Experiment functions: five ways to run a SpaDES experiment

A SpaDES "experiment" is a way of running a simulation many times with varying inputs, parameters, paths, scenarios, or replicates. This lets you run, for example, replication of stochastic models, hypothesis testing with different data inputs, scenario analysis of different human decisions, building large datasets of alternative mechanisms to enable ensemble modeling, and other possibilities.

Details

There are five functions to run one, in two groups. The first group (experiment() / experiment2()) is conceptually simpler: it works on in-memory simList objects and needs no project on disk. The second group (experimentTmux() / experimentFuture() / experimentSBATCH()) is built around a project global.R script (typically created by setupProject()) and a shared job queue, and is what you reach for once runs are numerous, long, or spread across machines.

Without `setupProject` (in-memory `simList`s)

These take simList object(s) directly, run SpaDES.core::spades() on each via a future backend, and return the live results as a simLists object you can post-process with as.data.table.simLists(). Best when the run set is modest, fits in RAM, and you want the result objects back in your session. They are not built for resume-after-crash, cross-machine pulls, or HPC. (Moved here from the now-unmaintained SpaDES.experiment package.)

experiment2(): The core in-memory runner: give it one or more simLists (and optionally replicates) and it runs them all and returns a simLists. You build the variation yourself, e.g. with several SpaDES.core::simInit() calls.
experiment(): A light wrapper around experiment2() that builds the variation for you: give it one base simList plus alternative params / modules / inputs / objects and it constructs the fully-factorial set of simLists (via factorialDesign()) and runs them. factorialDesign() is exported separately, so the same design can also seed the df of the second group below.

With `setupProject` (file queue + `global.R`)

Here the experiment is a data.frame (or data.table) where each row describes one set of values to be assigned to variables in the .GlobalEnv. When run via one of these functions, the data.frame is translated into a queue data.frame that has all the same columns and rows, plus a few more (status, claimed_by, etc.) to coordinate the run. After creating the queue, the function spawns a number of independent R "worker" sessions (according to n_workers or cores). Each worker selects a single row, assigns the values in each user-specified column to an object in the .GlobalEnv whose name is the column name, then source()s global.R. For example, if the data.frame has 2 rows and a column named runName with values "trial1" and "trial2", the first worker runs runName <- "trial1"; source(global_path) and the second runs runName <- "trial2"; source(global_path). The status column starts as "PENDING" for all rows; workers take the next "PENDING" (or "INTERRUPTED") row, skipping "DONE" rows, and mark a row "DONE" when it finishes without error before moving to the next.

These three share the queue and the run-naming convention and differ only in how parallel workers are spawned:

experimentTmux(): Allows the most interactivity and so is helpful when there is still debugging to perform. This will only work on a computer that has tmux installed. The function spawns one tmux pane per worker, optionally across ssh-reachable machines. Best for interactive use where you want to watch workers live (tmux attach). Workers can be stopped with tmuxKillPanes().
experimentFuture(): When there is little to no debugging necessary, this function will use background R processes using either callr::r_bg() if all workers are local, or future::cluster if some of the workers are on different machines. Best for stable scripts. Workers can be stopped with killExperimentFuture().
experimentSBATCH(): One Slurm batch job per worker. Best for HPC clusters with sbatch / squeue / scancel. Block with awaitExperimentSBATCH() (polls squeue) or stop with killExperimentSBATCH() (graceful via stop files; force = TRUE issues scancel). Inspect generated job scripts with dry_run = TRUE.

All three of these accept the same core arguments:

df: The parameter grid; one row = one job. Column names become R variables in the worker's .GlobalEnv before global.R is sourced.
global_path: Path to the R script each worker sources per job. Must be on a filesystem visible to all workers (matters for experimentSBATCH() and remote-host modes of the other two).
queue_path: Path to the local RDS queue file. Workers coordinate through file-based locks on this file; remove it (or point a fresh path) to start over, leave it to resume.
runNameLabel: Quoted expression evaluated against each row to derive a human-readable identifier (used in log messages, sentinel filenames, and tmuxListPanes() output). Default is the first two non-meta columns of the queue.
statusCalculate: Optional quoted expression that inspects the job's outputs and returns up-to-date status / heartbeat metadata. statusCalculate_LandR and statusCalculate_FireSenseFit are pre-built blocks for the most common SpaDES module outputs.
ss_id: Optional Google Sheets / Drive folder ID. When provided, workers mirror queue state to a sheet so a remote stakeholder can watch progress in a browser. With ss_id = NULL (default) the queue is purely local – no Google APIs are touched.

A typical usage pattern:


df <- expand.grid(.scenario = c("A", "B"), .rep = 1:2,
                  stringsAsFactors = FALSE)
ef <- experimentFuture(df = df, global_path = "global.R",
                       n_workers = 2L, log_dir = "logs")

Swap experimentFuture() for experimentTmux() or experimentSBATCH() (adjusting cores / n_workers / sbatch_opts) and the rest of the driver script is unchanged.

Why not just run `Rscript -e ...` per row?

At its core, that is exactly what each worker does. A worker assigns the row's columns into .GlobalEnv and calls source("global.R"), which is equivalent to:


Rscript -e '.ELFind <- "6.3.1"; .rep <- 1; source("global.R")'
Rscript -e '.ELFind <- "6.3.1"; .rep <- 2; source("global.R")'

When the number of sets to run is small, this works. As you add scenarios, machines, authentication, race conditions, etc. the bookkeeping grows past what's comfortable to maintain by hand. The experimentXXX functions are just that bookkeeping.

experimentXXX functions deal with several issues that arise when running "parallel" scripts, including:

Concurrency control: Two shells launched at the same second can both pick the same row. The experimentXXX functions take an exclusive filelock lock on the queue between read and write, so each row is claimed at most once across all workers and machines.
Resume after crash / ctrl-C: If a worker dies mid-job, the row is stuck "in progress" with no record. The experimentXXX functions mark the row RUNNING when claimed and DONE / INTERRUPTED when finished, so the next launch skips DONE rows and (optionally, via tmuxRefreshQueueStatus() or experimentFutureList() (kill = TRUE)) demotes orphaned RUNNING rows back to PENDING for re-claim.
Worker-pool sizing: Rscript &; Rscript &; Rscript & scales as "one process per row", which thrashes the box once you exceed the core count. The experimentXXX functions take n_workers and let each worker pull rows in sequence, so you cap parallelism explicitly.
Cross-machine claims: Spawning N rows on each of M machines means either replicating the parameter grid by hand (and risking duplicate work) or sharding it (and losing dynamic load-balancing). With the experimentXXX functions, every worker on every machine pulls from the same queue, so a slow machine just claims fewer rows.
Live observability: Rscript -e writes nothing structured – you scrape PIDs and tail logs. The experimentXXX functions maintain a queue with status / claimed_by / started_at / process_id / machine_name so queueRead() gives a full snapshot, and experimentFutureList() can enumerate live workers (and kill them) cluster-wide.
Remote-stakeholder visibility: When ss_id is supplied, the queue is mirrored to a Google Sheet a collaborator can open in a browser; without that, "how is the run going?" requires SSH access to the runner machine.
Outputs accounting: queueUploadMissing() / outScenarios() anti-join the queue against the Drive upload folder so you can see which DONE rows still need to be packaged and uploaded.
Run-name + status hooks: runNameLabel and statusCalculate give one place to derive directory names and inspect output artifacts – both per-runner and in tmuxRefreshQueueStatus() for post-hoc rescans – without each global.R re-implementing them.

If you only ever run two rows on one machine and never restart, the two-line shell version is fine. The experimentXXX functions exist for the cases past that.

Cross-machine propagation (cluster modes)

When you launch on more than one machine – experimentTmux(cores = c("mega", "birds")) or experimentFuture(cores = c("localhost", "camas")) – .setup_remote_machine() runs once per unique remote host before any worker starts. It tries to make the remote R session look enough like the local one that global.R runs the same way. What it propagates / sets up:

Package versions: SpaDES.project itself is rsynced from the local .libPaths()[1] to the remote (or, if loaded via devtools::load_all(), the source tree is rsynced and R CMD INSTALL-ed). Require is version- and RemoteSha-checked and rsynced if it's older or comes from a different source than locally. Then Require::Install() installs every package in SpaDES.project's Imports / Depends / LinkingTo, plus any Suggests installed locally (so optional runtime dependencies like googlesheets4, cli, etc. follow along but the dev toolchain doesn't).
Compiled-from-source packages: terra, sf, rgdal, rgeos, lwgeom are forced to compile from source on the remote (so they link against the remote's libgdal.so etc., which may be a different soversion than localhost's).
System libraries: A best-effort sudo -n apt-get install -y --no-install-recommends of the dev headers needed for the source-compiled packages (libgdal-dev, libssl-dev, libcurl4-openssl-dev, libxml2-dev, fonts/graphics, ...). Runs non-interactively; if passwordless sudo isn't configured the failure is logged and setup continues, expecting the libraries to be there already.
R startup environment: The remote ~/.Rprofile gets refreshed with .libPaths(c(<local_lib>, .libPaths())), options(repos = c("https://predictiveecology.r-universe.dev", <local repos>)), options(defaultPackages = ...) (so the remote uses the same minimal default-attached set as a fresh Rscript), and Sys.setenv(CURL_CA_BUNDLE, SSL_CERT_FILE) pointing at the system CA bundle (so libcurl can do HTTPS even when /etc/profile.d/ isn't sourced under non-login SSH). The remote $BASH_ENV, if set, is wrapped in a subshell guard so a misbehaving sleep $UNSET can't kill the SSH command shell before R starts.
GitHub credentials: The local GITHUB_PAT (read from gitcreds::gitcreds_get() or a caller-supplied local_pat_file) is written to the remote ~/.Renviron (chmod 0600) and to a per-lib file <local_lib>/.spades_github_pat that's read at the top of ~/.Rprofile. git credential approve is also called so command-line git on the remote authenticates the same way. Required for pak to install private modules / dev packages from GitHub.
Google credentials: The experimentXXX functions pass email + cache_path into each worker; the worker calls googlesheets4::gs4_auth(email = email, cache = cache_path) non-interactively against the same cached OAuth token directory the local session uses. The token directory itself isn't pushed (it's expected to already exist via NFS or a prior login on the remote); only the gargle_oauth_email / gargle_oauth_cache options are forwarded so the same identity is selected. If the cache isn't there, the worker prints a gs4_auth warning and continues without GS access.
User code: R/ folder + modules: The directory next to global.R called R/ (where project-specific helper functions live) is rsync -a --delete-ed to the remote, so anything global.R source()s from R/ works there too. With copyModules = TRUE the SpaDES module path (getOption("spades.modulePath")) is also rsynced, so module code stays in step.
The job artifacts themselves: global.R and the queue .rds are scp'd into the same path on the remote (or, if the path is already on NFS such as /mnt/shared_cache/..., they're effectively no-ops – same absolute path on both ends).

Net effect: global.R on camas sees the same packages at the same versions, the same GITHUB_PAT, the same R/ helpers, the same SSL trust store, and the same Google identity as global.R on mega. Hand-rolling all of that for each remote machine before each run is the bulk of what makes "Rscript -e ... on N hosts" miserable in practice; the experimentXXX functions do it once per unique host per call.

Managing remote workers from the calling machine

Once experimentFuture(cores = c("localhost", "camas", "dougfir")) is launched, the workers on camas and dougfir are no longer reachable via local ps / tools::pskill() – they are R processes on other machines. experimentFutureList() is the cluster-wide handle for them. Pass it the ef object and it will:

Read the queue file (which is the authoritative record: every claim writes machine_name + process_id under a filelock, so workers on every machine appear there even when they didn't redirect their stdout to a discoverable worker_NN.log).
Probe each entry in ef$cores once with ssh <core> hostname -s to build a map from OS hostname (which is what Sys.info()[["nodename"]] writes to the queue) to the SSH alias the master used to reach it (e.g. A159604 -> dougfir). This is needed because ssh A159604 typically fails – only ssh dougfir resolves via ~/.ssh/config / /etc/hosts.
For every status == "RUNNING" row, verify the worker is actually alive: file.exists("/proc/<pid>") for the local machine, batched ssh <alias> "[ -d /proc/<pid> ]" for each remote machine (one SSH connection per machine).
Return a data.frame with pid, machine, started_at, queue_path, runName for every live worker – local and remote, in one table.

kill = TRUE uses the same map to send the chosen signal (TERM default, INT or KILL on request): tools::pskill() for local PIDs and a single batched ssh <alias> "kill -<sig> p1 p2 ..." per remote machine. After signalling, it polls (locally via /proc, remotely via SSH) for up to 10 s until the workers actually exit, then runs tmuxRefreshQueueStatus() on each unique queue file to demote the now-orphaned RUNNING rows back to PENDING. When ss_id was supplied to the original experimentFuture() call, an <queue_path>.ss_id sidecar is left behind; kill = TRUE reads it and pushes the same demotion to the Google Sheet via .gs_demote_after_kill(), so the GS view converges with the local queue without a separate cleanup step.

Three usage shapes:


experimentFutureList(ef)                   # list everything live
experimentFutureList(ef, kill = TRUE)      # graceful TERM + queue refresh
experimentFutureList(ef, kill = TRUE, signal = "KILL")  # immediate

Across R sessions, when ef is gone, drive discovery off the queue path directly:


experimentFutureList(queue_paths = "/mnt/shared_cache/.../future_queue.rds")

Without ef, the hostname-to-alias probe is skipped, so the SSH check uses machine_name verbatim – which only works if the OS hostname is itself reachable via SSH on the calling node (i.e. it appears in ~/.ssh/config or /etc/hosts as a Host entry). If not, you'll need to either keep ef in scope or add the OS hostnames to your SSH config.

Concretely, the things you can do post-launch from the calling machine without ever opening a terminal on camas / dougfir:

See which row each remote worker is currently on.
Confirm that a remote worker actually died after a crash / network blip (otherwise the queue would stay stuck at RUNNING and no one would re-claim).
Send SIGTERM cluster-wide to abort an experiment mid-run, then immediately re-launch a fixed global.R against the same queue (any DONE rows are skipped, demoted RUNNING rows are re-claimed).
Mirror that demotion to the Google Sheet so a stakeholder watching in a browser sees the change without needing to be told.

Resource monitoring (CPU + RAM)

experimentMonitor() is the read-only entry point. Discovery depends on what you pass:

experimentMonitor() (no args) – enumerates every tmux pane on the calling machine across all tmux servers, same as the historical tmuxListPanes().
experimentMonitor(ef) – queue-driven discovery across all machines in ef$cores (with the hostname-to-SSH-alias probe described above).
experimentMonitor(queue_paths = "...") – same as ef mode, but for cross-session use when the ef handle is gone.

stats = TRUE batches ps -o pid=,%cpu=,rss=,state= (locally and via one SSH connection per remote node) to append:

state – R (running on CPU), S (sleeping / waiting), D (uninterruptible sleep, often disk I/O – persistent D = hang), T (stopped), Z (zombie), Closed (R session exited but tmux pane still open).
cpuAvg – percent CPU averaged over the process's lifetime (note: not the instantaneous rate htop shows).
RAM (GB) – resident memory (RSS), 1 decimal place.
availableCores – total CPUs on the node, from nproc.
total RAM (GB) – total RAM on the node, from /proc/meminfo.

availableCores and total RAM (GB) are constant across all rows on the same node, so each pane's resource use is visible relative to its node capacity. Unreachable nodes get NA for all their rows; titles missing a parseable <node>-<pid> get NA too – one bad pane / unreachable host doesn't poison the rest of the table.

Single function, three sources, same stats columns either way – so a stakeholder running experimentMonitor(ef, stats = TRUE) on a laptop sees the same per-worker CPU / RAM picture that experimentMonitor(stats = TRUE) (legacy tmux mode) gives on the master node. tmuxListPanes() is preserved as a thin alias that calls experimentMonitor() with no ef, so older code keeps working unchanged.

Related families:

scenario_family – canonical record for one row of df, reversibly convertible between field values, an output directory path, and an upload tarball filename.
queueRead() / queueUploadMissing() / outList() / outScenarios() – helpers for queues persisted to a Google Sheet plus a Drive upload folder, including the queue-vs-uploads anti-join.
experimentMonitor() – read-only worker / pane lister. With no args, scans tmux panes; with ef or queue_paths, scans the queue file's RUNNING rows and verifies each PID is alive (local /proc, batched SSH for remotes). stats = TRUE adds per-worker CPU / RSS / state and per-node nproc / total RAM via batched ps. tmuxListPanes() is a thin alias for the no-args form.
tmuxRefreshQueueStatus() / tmuxFindDuplicates() / tmuxKillPanes() – operational tools that work regardless of which runner produced the queue.
experimentFutureList() – experimentFuture-side equivalent of tmuxListPanes(): discovers live workers across the cluster (driven off the queue file's RUNNING rows plus an ssh <core> hostname -s alias probe), and with kill = TRUE sends TERM / INT / KILL to all of them in one call (local via tools::pskill(), remote batched per machine via SSH), then refreshes the queue and demotes the matching Google-Sheet rows when an <queue_path>.ss_id sidecar is present.

Controlling which events run

All of these honour spades()'s events argument, which restricts the events executed for each module (see SpaDES.core::spades()):

experiment() / experiment2(): pass events as a named argument; it is forwarded to every spades() call, e.g. experiment2(sim1, sim2, events = list(fireSpread = "init")). The same events apply to all simulations / replicates.
experimentTmux() / experimentFuture() / experimentSBATCH(): there is no events argument because these functions do not call spades() – your global.R does. To control events, add an events column to df (each cell is the events spec for that row) and, inside global.R, call spades(sim, events = events). Because each row carries its own value, this gives per-scenario control of which events run for any particular module – something the single shared events of the in-memory family cannot do.