Monitor live workers across an experiment (tmux panes or callr/cluster futures)

Single read-only entry point for inspecting workers regardless of which runner spawned them. Discovery is driven by what you pass:

experimentMonitor(ef = NULL, queue_paths = NULL, stats = FALSE)

Arguments

ef: Optional "experimentFuture" object (or list of them) whose queue_path and cores will be used for discovery. Switches the function from tmux-scan mode to queue-scan mode.
queue_paths: Optional character vector of queue .rds paths. Equivalent to passing ef = NULL plus queue_paths; used when the ef handle is no longer in scope (e.g. across R sessions). When queue_paths is supplied without ef, the SSH-alias probe is skipped and machine_name from the queue is used verbatim as the SSH target – which only works if the OS hostname is itself a Host entry in ~/.ssh/config / /etc/hosts.
stats: Logical. When TRUE, queries ps per worker (locally or via batched SSH) to append state, cpuAvg (percent CPU averaged over the process's lifetime – not the instantaneous rate htop shows), RAM (GB) (resident memory), availableCores (total CPUs on the node, from nproc), and total RAM (GB) (total RAM on the node, from /proc/meminfo). Default FALSE.

Value

Data.frame whose columns depend on the discovery mode:

tmux mode – session, window, pane, pane_id, pane_ref (the "session:window.pane" string), title, node (first dash-separated token in title that matches a cluster alias from /etc/hosts; falls back to localHostLabel() when the title contains only the raw local hostname; NA if no match).
queue mode – pid, machine, started_at, log_file (NA when the worker isn't a callr::r_bg writer), queue_path, runName.

With stats = TRUE, five additional columns appear in either mode: state, cpuAvg, RAM (GB), availableCores, total RAM (GB). Returns an empty data.frame (0 rows, same columns) if no workers are found.

Details

Default (ef = NULL, queue_paths = NULL) – enumerates tmux panes via tmux -S <socket> list-panes -a across every tmux server under $TMUX_TMPDIR/tmux-<uid>/. Same behaviour the historical tmuxListPanes() had. Per-socket failures are swallowed so one broken socket cannot poison the rest; works outside a tmux pane and across multiple tmux servers (e.g. sessions started under different -L names). Cluster_Monitor panes are filtered out.
ef supplied (or queue_paths) – reads each queue file's status == "RUNNING" rows, probes ssh <core> hostname -s once per non-local entry in ef$cores to map OS hostnames (which is what Sys.info()[["nodename"]] writes to the queue) back to SSH aliases (~/.ssh/config / /etc/hosts entries), and verifies each PID is alive (/proc/<pid> locally, batched ssh <alias> "[ -d /proc/<pid> ]" remotely). This is the experimentFuture() / experimentSBATCH() equivalent of the tmux pane scan – workers there don't necessarily live in a tmux pane, so the queue file is the authoritative record.

Either way, stats = TRUE runs the same ps -o pid=,%cpu=,rss=,state= batch (locally and via one SSH connection per remote node) to append CPU / RSS / state plus per-node nproc / total RAM.

State codes

The state column is the best single signal for hang-detection because it is a snapshot (no time window needed). Values:

State	Meaning
`R`	running on CPU right now
`S`	sleeping (waiting on I/O, timer, or lock)
`D`	uninterruptible sleep (usually disk I/O; persistent `D` can indicate a hang)
`T`	stopped (SIGSTOP or similar)
`Z`	zombie (dead but not yet reaped)
`Closed`	worker process has exited – PID no longer exists
`NA`	could not determine (machine unreachable, or no parseable `<node>-<pid>` in title)

Monitor live workers across an experiment (tmux panes or callr/cluster futures)

Arguments

Value

Details

State codes

See also