R/experiment-family.R
experiment_family.RdA SpaDES "experiment" is a way of running a simulation many times with
varying inputs, parameters, paths, scenarios, or replicates. This lets
you run, for example, replication of stochastic models, hypothesis testing
with different data inputs, scenario analysis of different human decisions,
building large datasets of alternative mechanisms to enable ensemble
modeling, and other possibilities.
There are five functions to run one, in two groups. The first group
(experiment() / experiment2()) is conceptually simpler: it works on
in-memory simList objects and needs no project on disk. The second group
(experimentTmux() / experimentFuture() / experimentSBATCH()) is built
around a project global.R script (typically created by setupProject())
and a shared job queue, and is what you reach for once runs are numerous,
long, or spread across machines.
setupProject (in-memory simLists)These take simList object(s) directly, run SpaDES.core::spades() on each
via a future backend, and return the live results as a
simLists object you can post-process with
as.data.table.simLists(). Best when the run set is modest, fits in RAM, and
you want the result objects back in your session. They are not built for
resume-after-crash, cross-machine pulls, or HPC. (Moved here from the
now-unmaintained SpaDES.experiment package.)
experiment2()The core in-memory runner: give it one or more
simLists (and optionally replicates) and it runs them all and returns
a simLists. You build the variation yourself, e.g. with
several SpaDES.core::simInit() calls.
experiment()A light wrapper around experiment2() that builds
the variation for you: give it one base simList plus alternative
params / modules / inputs / objects and it constructs the
fully-factorial set of simLists (via factorialDesign()) and runs
them. factorialDesign() is exported separately, so the same design can
also seed the df of the second group below.
setupProject (file queue + global.R)Here the experiment is a data.frame (or data.table) where each row
describes one set of values to be assigned to variables in the .GlobalEnv.
When run via one of these functions, the data.frame is translated into a
queue data.frame that has all the same columns and rows, plus a few more
(status, claimed_by, etc.) to coordinate the run. After creating the
queue, the function spawns a number of independent R "worker" sessions
(according to n_workers or cores). Each worker selects a single row,
assigns the values in each user-specified column to an object in the
.GlobalEnv whose name is the column name, then source()s global.R. For
example, if the data.frame has 2 rows and a column named runName with
values "trial1" and "trial2", the first worker runs
runName <- "trial1"; source(global_path) and the second runs
runName <- "trial2"; source(global_path). The status column starts as
"PENDING" for all rows; workers take the next "PENDING" (or "INTERRUPTED")
row, skipping "DONE" rows, and mark a row "DONE" when it finishes without
error before moving to the next.
These three share the queue and the run-naming convention and differ only in how parallel workers are spawned:
experimentTmux()Allows the most interactivity and so
is helpful when there is still debugging to perform. This will
only work on a computer that has tmux installed. The function
spawns one tmux pane per worker, optionally across
ssh-reachable machines. Best for interactive use where you want to
watch workers live (tmux attach). Workers can be stopped with
tmuxKillPanes().
experimentFuture()When there is little to no debugging
necessary, this function will use background R processes using
either callr::r_bg() if all workers are local,
or future::cluster if some of the workers are on different machines.
Best for stable scripts. Workers can be stopped with killExperimentFuture().
experimentSBATCH()One Slurm batch job per worker. Best for
HPC clusters with sbatch / squeue / scancel. Block with
awaitExperimentSBATCH() (polls squeue) or stop with
killExperimentSBATCH() (graceful via stop files; force = TRUE
issues scancel). Inspect generated job scripts with
dry_run = TRUE.
All three of these accept the same core arguments:
dfThe parameter grid; one row = one job. Column names become
R variables in the worker's .GlobalEnv before global.R is sourced.
global_pathPath to the R script each worker sources per job.
Must be on a filesystem visible to all workers (matters for
experimentSBATCH() and remote-host modes of the other two).
queue_pathPath to the local RDS queue file. Workers coordinate through file-based locks on this file; remove it (or point a fresh path) to start over, leave it to resume.
runNameLabelQuoted expression evaluated against each row to
derive a human-readable identifier (used in log messages, sentinel
filenames, and tmuxListPanes() output). Default is the first two
non-meta columns of the queue.
statusCalculateOptional quoted expression that inspects the job's outputs and returns up-to-date status / heartbeat metadata. statusCalculate_LandR and statusCalculate_FireSenseFit are pre-built blocks for the most common SpaDES module outputs.
ss_idOptional Google Sheets / Drive folder ID. When provided,
workers mirror queue state to a sheet so a remote stakeholder can
watch progress in a browser. With ss_id = NULL (default) the
queue is purely local – no Google APIs are touched.
A typical usage pattern:
df <- expand.grid(.scenario = c("A", "B"), .rep = 1:2,
stringsAsFactors = FALSE)
ef <- experimentFuture(df = df, global_path = "global.R",
n_workers = 2L, log_dir = "logs")Swap experimentFuture() for experimentTmux() or experimentSBATCH()
(adjusting cores / n_workers / sbatch_opts) and the rest of the
driver script is unchanged.
Rscript -e ... per row?At its core, that is exactly what each worker does. A worker assigns the
row's columns into .GlobalEnv and calls source("global.R"), which is
equivalent to:
Rscript -e '.ELFind <- "6.3.1"; .rep <- 1; source("global.R")'
Rscript -e '.ELFind <- "6.3.1"; .rep <- 2; source("global.R")'
When the number of sets to run is small, this works. As you add scenarios,
machines, authentication, race conditions, etc. the bookkeeping grows past what's
comfortable to maintain by hand. The experimentXXX functions are just that
bookkeeping.
experimentXXX functions deal with several issues that arise when running "parallel"
scripts, including:
Two shells launched at the same second can
both pick the same row. The experimentXXX functions take an exclusive filelock lock
on the queue between read and write, so each row is claimed at most
once across all workers and machines.
If a worker dies mid-job, the row is
stuck "in progress" with no record. The experimentXXX functions mark the row RUNNING
when claimed and DONE / INTERRUPTED when finished, so the next launch
skips DONE rows and (optionally, via tmuxRefreshQueueStatus() or
experimentFutureList() (kill = TRUE)) demotes orphaned RUNNING
rows back to PENDING for re-claim.
Rscript &; Rscript &; Rscript & scales as
"one process per row", which thrashes the box once you exceed the
core count. The experimentXXX functions take n_workers and let each worker pull
rows in sequence, so you cap parallelism explicitly.
Spawning N rows on each of M machines means either replicating the parameter grid by hand (and risking duplicate work) or sharding it (and losing dynamic load-balancing). With the experimentXXX functions, every worker on every machine pulls from the same queue, so a slow machine just claims fewer rows.
Rscript -e writes nothing structured – you
scrape PIDs and tail logs. The experimentXXX functions maintain a queue with
status / claimed_by / started_at / process_id / machine_name
so queueRead() gives a full snapshot, and experimentFutureList()
can enumerate live workers (and kill them) cluster-wide.
When ss_id is supplied, the
queue is mirrored to a Google Sheet a collaborator can open in a
browser; without that, "how is the run going?" requires SSH
access to the runner machine.
queueUploadMissing() / outScenarios()
anti-join the queue against the Drive upload folder so you can see
which DONE rows still need to be packaged and uploaded.
runNameLabel and statusCalculate
give one place to derive directory names and inspect output
artifacts – both per-runner and in tmuxRefreshQueueStatus() for
post-hoc rescans – without each global.R re-implementing them.
If you only ever run two rows on one machine and never restart, the two-line shell version is fine. The experimentXXX functions exist for the cases past that.
When you launch on more than one machine – experimentTmux(cores =
c("mega", "birds")) or experimentFuture(cores = c("localhost",
"camas")) – .setup_remote_machine() runs once per unique
remote host before any worker starts. It tries to make the remote R
session look enough like the local one that global.R runs the
same way. What it propagates / sets up:
SpaDES.project itself is rsynced from the local
.libPaths()[1] to the remote (or, if loaded via
devtools::load_all(), the source tree is rsynced and
R CMD INSTALL-ed). Require is version- and
RemoteSha-checked and rsynced if it's older or comes from a
different source than locally. Then Require::Install()
installs every package in SpaDES.project's Imports /
Depends / LinkingTo, plus any Suggests
installed locally (so optional runtime dependencies like
googlesheets4, cli, etc. follow along but the dev
toolchain doesn't).
terra, sf, rgdal, rgeos, lwgeom
are forced to compile from source on the remote (so they link
against the remote's libgdal.so etc., which may be a
different soversion than localhost's).
A best-effort sudo -n apt-get install -y --no-install-recommends
of the dev headers needed for the source-compiled packages
(libgdal-dev, libssl-dev, libcurl4-openssl-dev,
libxml2-dev, fonts/graphics, ...). Runs non-interactively;
if passwordless sudo isn't configured the failure is logged and
setup continues, expecting the libraries to be there already.
The remote ~/.Rprofile gets refreshed with
.libPaths(c(<local_lib>, .libPaths())),
options(repos = c("https://predictiveecology.r-universe.dev",
<local repos>)), options(defaultPackages = ...) (so the
remote uses the same minimal default-attached set as a fresh
Rscript), and Sys.setenv(CURL_CA_BUNDLE, SSL_CERT_FILE)
pointing at the system CA bundle (so libcurl can do HTTPS
even when /etc/profile.d/ isn't sourced under non-login
SSH). The remote $BASH_ENV, if set, is wrapped in a
subshell guard so a misbehaving sleep $UNSET can't kill
the SSH command shell before R starts.
The local GITHUB_PAT (read from gitcreds::gitcreds_get()
or a caller-supplied local_pat_file) is written to the
remote ~/.Renviron (chmod 0600) and to a per-lib file
<local_lib>/.spades_github_pat that's read at the top of
~/.Rprofile. git credential approve is also called
so command-line git on the remote authenticates the same
way. Required for pak to install private modules / dev
packages from GitHub.
The experimentXXX functions pass email + cache_path into each
worker; the worker calls
googlesheets4::gs4_auth(email = email, cache = cache_path)
non-interactively against the same cached OAuth token directory
the local session uses. The token directory itself isn't pushed
(it's expected to already exist via NFS or a prior login on the
remote); only the gargle_oauth_email /
gargle_oauth_cache options are forwarded so the same
identity is selected. If the cache isn't there, the worker prints
a gs4_auth warning and continues without GS access.
R/ folder + modulesThe directory next to global.R called R/ (where
project-specific helper functions live) is rsync -a --delete-ed
to the remote, so anything global.R source()s from
R/ works there too. With copyModules = TRUE the
SpaDES module path (getOption("spades.modulePath")) is
also rsynced, so module code stays in step.
global.R and the queue .rds are scp'd into the
same path on the remote (or, if the path is already on NFS such
as /mnt/shared_cache/..., they're effectively no-ops –
same absolute path on both ends).
Net effect: global.R on camas sees the same packages at
the same versions, the same GITHUB_PAT, the same R/
helpers, the same SSL trust store, and the same Google identity as
global.R on mega. Hand-rolling all of that for each
remote machine before each run is the bulk of what makes
"Rscript -e ... on N hosts" miserable in practice; the experimentXXX functions
do it once per unique host per call.
Once experimentFuture(cores = c("localhost", "camas", "dougfir"))
is launched, the workers on camas and dougfir are no
longer reachable via local ps / tools::pskill() – they
are R processes on other machines. experimentFutureList()
is the cluster-wide handle for them. Pass it the ef object
and it will:
Read the queue file (which is the authoritative record:
every claim writes machine_name + process_id under
a filelock, so workers on every machine appear there even
when they didn't redirect their stdout to a discoverable
worker_NN.log).
Probe each entry in ef$cores once with
ssh <core> hostname -s to build a map from OS hostname
(which is what Sys.info()[["nodename"]] writes to the
queue) to the SSH alias the master used to reach it
(e.g. A159604 -> dougfir). This is needed because
ssh A159604 typically fails – only ssh dougfir
resolves via ~/.ssh/config / /etc/hosts.
For every status == "RUNNING" row, verify the worker
is actually alive: file.exists("/proc/<pid>") for the
local machine, batched ssh <alias> "[ -d /proc/<pid> ]"
for each remote machine (one SSH connection per machine).
Return a data.frame with pid, machine,
started_at, queue_path, runName for every
live worker – local and remote, in one table.
kill = TRUE uses the same map to send the chosen signal
(TERM default, INT or KILL on request):
tools::pskill() for local PIDs and a single batched
ssh <alias> "kill -<sig> p1 p2 ..." per remote machine.
After signalling, it polls (locally via /proc, remotely via
SSH) for up to 10 s until the workers actually exit, then runs
tmuxRefreshQueueStatus() on each unique queue file to demote the
now-orphaned RUNNING rows back to PENDING. When
ss_id was supplied to the original experimentFuture()
call, an <queue_path>.ss_id sidecar is left behind;
kill = TRUE reads it and pushes the same demotion to the
Google Sheet via .gs_demote_after_kill(), so the GS view
converges with the local queue without a separate cleanup step.
Three usage shapes:
experimentFutureList(ef) # list everything live
experimentFutureList(ef, kill = TRUE) # graceful TERM + queue refresh
experimentFutureList(ef, kill = TRUE, signal = "KILL") # immediateAcross R sessions, when ef is gone, drive discovery off the
queue path directly:
experimentFutureList(queue_paths = "/mnt/shared_cache/.../future_queue.rds")Without ef, the hostname-to-alias probe is skipped, so the
SSH check uses machine_name verbatim – which only works if
the OS hostname is itself reachable via SSH on the calling node
(i.e. it appears in ~/.ssh/config or /etc/hosts as a
Host entry). If not, you'll need to either keep ef in scope
or add the OS hostnames to your SSH config.
Concretely, the things you can do post-launch from the calling
machine without ever opening a terminal on camas /
dougfir:
See which row each remote worker is currently on.
Confirm that a remote worker actually died after a crash /
network blip (otherwise the queue would stay stuck at
RUNNING and no one would re-claim).
Send SIGTERM cluster-wide to abort an experiment
mid-run, then immediately re-launch a fixed global.R
against the same queue (any DONE rows are skipped, demoted
RUNNING rows are re-claimed).
Mirror that demotion to the Google Sheet so a stakeholder watching in a browser sees the change without needing to be told.
experimentMonitor() is the read-only entry point. Discovery
depends on what you pass:
experimentMonitor() (no args) – enumerates every
tmux pane on the calling machine across all tmux servers, same
as the historical tmuxListPanes().
experimentMonitor(ef) – queue-driven discovery
across all machines in ef$cores (with the
hostname-to-SSH-alias probe described above).
experimentMonitor(queue_paths = "...") – same as
ef mode, but for cross-session use when the ef
handle is gone.
stats = TRUE batches ps -o pid=,%cpu=,rss=,state=
(locally and via one SSH connection per remote node) to append:
state – R (running on CPU), S
(sleeping / waiting), D (uninterruptible sleep, often disk
I/O – persistent D = hang), T (stopped),
Z (zombie), Closed (R session exited but tmux pane
still open).
cpuAvg – percent CPU averaged over the process's
lifetime (note: not the instantaneous rate htop
shows).
RAM (GB) – resident memory (RSS), 1 decimal place.
availableCores – total CPUs on the node, from
nproc.
total RAM (GB) – total RAM on the node, from
/proc/meminfo.
availableCores and total RAM (GB) are constant across
all rows on the same node, so each pane's resource use is visible
relative to its node capacity. Unreachable nodes get NA for
all their rows; titles missing a parseable <node>-<pid> get
NA too – one bad pane / unreachable host doesn't poison the
rest of the table.
Single function, three sources, same stats columns either
way – so a stakeholder running experimentMonitor(ef, stats =
TRUE) on a laptop sees the same per-worker CPU / RAM picture that
experimentMonitor(stats = TRUE) (legacy tmux mode) gives on
the master node. tmuxListPanes() is preserved as a thin alias
that calls experimentMonitor() with no ef, so older
code keeps working unchanged.
Related families:
scenario_family – canonical record for one row of df,
reversibly convertible between field values, an output directory
path, and an upload tarball filename.
queueRead() / queueUploadMissing() / outList() /
outScenarios() – helpers for queues persisted to a Google
Sheet plus a Drive upload folder, including the
queue-vs-uploads anti-join.
experimentMonitor() – read-only worker / pane lister.
With no args, scans tmux panes; with ef or
queue_paths, scans the queue file's RUNNING rows and
verifies each PID is alive (local /proc, batched SSH for
remotes). stats = TRUE adds per-worker CPU / RSS /
state and per-node nproc / total RAM via batched
ps. tmuxListPanes() is a thin alias for the no-args
form.
tmuxRefreshQueueStatus() / tmuxFindDuplicates() /
tmuxKillPanes() – operational tools that work regardless of
which runner produced the queue.
experimentFutureList() – experimentFuture-side
equivalent of tmuxListPanes(): discovers live workers
across the cluster (driven off the queue file's RUNNING rows
plus an ssh <core> hostname -s alias probe), and with
kill = TRUE sends TERM / INT / KILL
to all of them in one call (local via tools::pskill(),
remote batched per machine via SSH), then refreshes the queue
and demotes the matching Google-Sheet rows when an
<queue_path>.ss_id sidecar is present.
All of these honour spades()'s events argument, which restricts the
events executed for each module (see SpaDES.core::spades()):
experiment() / experiment2(): pass events as a named argument;
it is forwarded to every spades() call, e.g.
experiment2(sim1, sim2, events = list(fireSpread = "init")). The
same events apply to all simulations / replicates.
experimentTmux() / experimentFuture() / experimentSBATCH():
there is no events argument because these functions do not call
spades() – your global.R does. To control events, add an events
column to df (each cell is the events spec for that row) and, inside
global.R, call spades(sim, events = events). Because each row carries
its own value, this gives per-scenario control of which events run
for any particular module – something the single shared events of the
in-memory family cannot do.