GNU Parallel: Running Many Jobs Across Multiple GPUs, CPU Cores, or Compute Nodes to Accelerate AI Workflows

Community Article Published March 7, 2026

GNU Parallel tutorial -- How to build a HPC cluster for AI using your local machines.

This tutorial shows how to use GNU Parallel to run, say, 100 Python or R jobs with at most N (say, 4) jobs running at the same time, while cleanly passing environment variables to each .py (or .R) worker script. It also shows how to manage jobs across a multi-GPU environment where each GPU can only handle a specific number of jobs, and how to send jobs across machines via ssh under unified GNU parallel management. Learning these principles will help you accelerate your AI workflows by more fully leveraging the compute available to you.

1. Recommended GNU Parallel Practices

Strong patterns when using GNU Parallel for batch jobs include:

nohup parallel ... & for long-running background jobs
--joblog for durable, machine-readable tracking of every job
--delay to stagger job starts and avoid thundering-herd problems
seq to generate clean numeric job sequences
Exported environment variables passed via --env when workers need shared context

These techniques make workflows easier to monitor, restart, and debug.

2. Thread Control for Python

GNU Parallel’s -j4 limits the number of worker processes. It does not stop math libraries (NumPy, OpenBLAS, MKL, Accelerate, NumExpr, etc.) from spawning extra threads inside each process.

Without explicit caps, four Python jobs can easily turn into 16–32 runnable threads, causing oversubscription, sluggish performance, and thermal throttling.

Usually, set these limits before launching (this is often the single most important macOS-specific detail):

export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

3. Using Environment Variables for Clean Job Context

Keep each Python worker as stateless as practical. Pass configuration through environment variables:

Shared values (identical for every job): export once + --env
Per-job values (unique): assign inline right before calling the script

This reduces hidden shared state and makes job context explicit. Typical variables:

RUN_TAG — shared batch identifier
JOB_ID — unique per job
SLOT — GNU Parallel slot number (1–4)

4. Install GNU Parallel

brew install parallel
parallel --citation          # one-time acknowledgement
parallel --number-of-cores

5. Mental Model: 100 Jobs, Maximum 4 Concurrent

seq 1 100 | parallel -j4 'python3 worker.py'

100 total jobs
At most 4 running at once
As soon as one finishes, the next queued job starts automatically

-j4 is a concurrency limit, not hard CPU pinning. macOS decides the actual core placement.

6. Essential Placeholders

Placeholder	Meaning	Typical use
`{}`	The input item	`JOB_ID`
`{#}`	Job sequence number	overall counter
`{%}`	Current slot number (1 to N)	`SLOT`

Quick demo:

seq 1 6 | parallel -j4 'echo input={} seq={#} slot={%}'

{%} is a GNU Parallel slot number, not a CPU core ID.

7. Minimal Stateless Python Worker

#!/usr/bin/env python3
import os
import time
from pathlib import Path

job_id = int(os.environ["JOB_ID"])
slot = int(os.environ["SLOT"])
run_tag = os.environ["RUN_TAG"]

Path("logs").mkdir(exist_ok=True)

print(f"start job_id={job_id} slot={slot} run_tag={run_tag}", flush=True)

# Replace with your actual work here
time.sleep(1)

with open(f"logs/{run_tag}_job_{job_id:03d}.txt", "w") as f:
    f.write(f"job_id={job_id}\nslot={slot}\nrun_tag={run_tag}\n")

print(f"done job_id={job_id} slot={slot}", flush=True)

Each worker gets its own JOB_ID and writes to its own output file, which avoids output collisions in the common case.

8. Reusable Launcher Script

Save a script like this (tailored to your task) as a .sh file. and invoke it via the terminal (e.g., sh launch_script.sh).

#!/usr/bin/env bash
set -euo pipefail

mkdir -p logs

export RUN_TAG="run_2026"

# Thread caps
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

# Preview first (highly recommended)
parallel --dry-run --line-buffer --jobs 4 \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100)

# Real run in background
nohup parallel --line-buffer --jobs 4 \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100) \
  > "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &

echo $! > "logs/${RUN_TAG}.pid"

Two notes:

set -euo pipefail helps catch setup errors in the launcher itself, but it does not make the script fail fast on errors that happen later inside the detached nohup ... & job.
Writing the PID to logs/${RUN_TAG}.pid makes it easier to inspect or stop the batch later.

9. Explaining Each Flag

set -euo pipefail — catches many launcher-script mistakes early
mkdir -p logs — guarantees the output directory exists
Thread exports — reduce oversubscription risk
--dry-run — shows exactly what will run before committing
--line-buffer — makes live output easier to read; complete lines from different jobs can still appear interleaved
--jobs 4 — concurrency limit
--joblog — TSV record of every job (start time, runtime, exit code, command)
--delay 0.2 — gentle stagger
--env ... — makes dependencies explicit and is especially useful for remote jobs
JOB_ID={} SLOT={%} — per-job context assignment
nohup ... & — lets the batch survive terminal close

10. Shared vs Per-Job Environment Variables

# Shared (export + optional --env for explicitness)
export RUN_TAG=batchA
parallel --env RUN_TAG 'python3 worker.py' ::: $(seq 1 10)

# Per-job (inline)
parallel 'JOB_ID={} python3 worker.py' ::: $(seq 1 10)

# Combined pattern 
export RUN_TAG=batchA
parallel --env RUN_TAG 'JOB_ID={} SLOT={%} python3 worker.py' ::: $(seq 1 10)

11. Good Logging Habits

Each job writes its own output file when practical
Keep one batch-wide .out and .err file
Always keep the --joblog TSV
Save the launcher PID if you detach the batch

12. Monitoring Commands

tail -f logs/${RUN_TAG}.out
tail -n +2 logs/${RUN_TAG}_joblog.tsv | wc -l     # jobs logged so far
column -t < logs/${RUN_TAG}_joblog.tsv | less -S
ps aux | grep '[p]ython3 -u ./worker.py'
cat logs/${RUN_TAG}.pid

If you need the final success/failure status of every job, treat --joblog as the source of truth.

13. Validation Sequence

Dry-run with placeholders
Tiny 8-job test
Verify logs and per-job files
Scale to 100 (or more)

14. Common Pitfalls & Fixes

Mistake	Symptom	Fix
Forgetting thread caps	High CPU usage, slowdown	Export the five `*_NUM_THREADS=1` vars
All workers writing to same file	Interleaved/corrupted output	Use unique filenames per `JOB_ID`
Hidden shared state	Race conditions, non-reproducible runs	Pass configuration explicitly and isolate outputs
Skipping `--dry-run`	Quoting or placeholder bugs	Preview first
Expecting `-j4` = perfect pinning	Confusion about scheduler behavior	Treat `-jN` as a concurrency limit

15. When `--env` Is Required

For purely local runs, exported variables are usually inherited by child processes already. Using --env is still a good habit because it:

Makes dependencies explicit
Improves readability and maintainability
Future-proofs the script for remote execution

16. Laptop-Specific Tip

Prevent macOS from sleeping during long foreground runs:

caffeinate -i bash your_launcher.sh

If your launcher immediately detaches the real work with nohup ... &, then caffeinate -i bash your_launcher.sh will end as soon as the launcher exits. In that case, wrap caffeinate around the long-lived parallel command itself, or run the batch in the foreground while caffeinate is active.

17. One-Command Pattern

mkdir -p logs
export RUN_TAG="run_2026"
export OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 VECLIB_MAXIMUM_THREADS=1 NUMEXPR_NUM_THREADS=1

nohup parallel --line-buffer --jobs 4 \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100) \
  > "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &

echo $! > "logs/${RUN_TAG}.pid"

18. Advanced: GPU locking with `sem`

GNU Parallel also has a semaphore tool:

sem

sem is an alias for:

parallel --semaphore

Use it when most of the pipeline is parallel, but one narrow step must be serialized or rate-limited.

Good uses include:

one SQLite writer at a time
one tar/zip append step at a time
only two calls at a time to a rate-limited external API
forcing narrow disk-I/O sections to happen one at a time

18.1 When `sem` is the right tool

sem is great when you need a mutex or a counting gate.

A different pattern is needed when the resource has an identity and the worker must know which one it got. In that case, use a real lock/allocator that records the assigned resource and cleans up reliably.

Examples where identity matters:

GPU 0 vs GPU 1
a specific port number
a specific scratch directory

sem limits how many jobs may enter a critical section, but it does not naturally tell the worker which resource ID it acquired.

So the rule is:

if you only need a mutex or a counting gate, use sem
if you need to allocate a specific resource identity, use a real lock/allocator

Intuitively, sem is great when you need a mutex or a counting gate. That is, the protected resource is anonymous — workers only need to know "it's my turn," not "which one did I get."

For example, suppose 50 jobs each produce a small CSV and you want to append them all into one combined file:

for f in results/*.csv; do
  sem -j1 --id append_lock "cat '$f' >> combined.csv"
done
sem --wait --id append_lock

No job needs to know anything about the lock itself — it just needs the guarantee that only one cat >> combined.csv runs at a time so lines don't interleave. The semaphore is a faceless gate: enter, do the write, leave.

Similarly, with a counting gate (-j3 instead of -j1), three workers can hit an API simultaneously but the fourth blocks until a slot opens. No worker needs to know "I'm in slot 2" — it just needs permission to go.

Contrast this with something like GPU assignment: if you have two GPUs and each job must set CUDA_VISIBLE_DEVICES=0 or CUDA_VISIBLE_DEVICES=1, a bare semaphore can't help — the worker needs to know which GPU it got. That requires an allocator (e.g., a FIFO queue of GPU IDs) rather than a simple count-based gate.

18.2 Important `sem` behavior

Three details matter:

sem defaults to background mode
in scripts, you should set an explicit semaphore name with --id
if you intentionally queue background semaphore jobs, finish with sem --wait --id NAME

The explicit --id matters because the default semaphore name is tied to the controlling TTY, which is convenient interactively but fragile in scripts.

When the current shell must block until the protected command finishes, add --fg.

18.3 Simple mutex example

Suppose the main computation can run in parallel, but the final append step must happen one job at a time:

parallel --jobs 4 --env RUN_TAG \
  'JOB_ID={} SLOT={%} python3 -u ./compute_stage.py &&
   sem --fg --id "${RUN_TAG}_append" -j1 \
     "JOB_ID={} python3 -u ./append_one_result.py"' \
  ::: $(seq 1 100)

What is happening here:

GNU Parallel runs four compute jobs concurrently
each job reaches the append step when ready
sem --fg --id "${RUN_TAG}_append" -j1 turns the append step into a mutex
--fg is important because the outer job should wait until the protected step finishes

If you omit --fg, sem will normally queue the protected command in the background and the outer GNU Parallel job may finish too early.

18.4 Counting semaphore example

Suppose only two workers at a time may call an expensive remote service:

parallel --jobs 4 \
  'JOB_ID={} python3 -u ./prepare_request.py &&
   sem --fg --id paid_api -j2 \
     "JOB_ID={} python3 -u ./call_paid_api.py"' \
  ::: $(seq 1 100)

This means:

four jobs can do the cheap local preparation concurrently
only two jobs can be inside call_paid_api.py at once

18.5 Background semaphore pattern

Sometimes you do want the semaphore-wrapped commands to queue in the background. In that case, explicitly wait for them:

for f in logs/*.json; do
  sem --id merge_json -j1 python3 -u ./merge_one_file.py "$f"
done

sem --wait --id merge_json

That final sem --wait is what turns “jobs were queued” into “the merge is definitely complete.”

18.6 Locking and disk I/O

The GNU Parallel manual specifically calls out semaphores as a way to reduce disk contention. If your workload is “read everything, compute, write everything,” a narrow serialized I/O section can outperform letting all jobs hammer the same disk at once.

This is especially relevant if your Python workers all read from or write to a single slow external disk.

18.7 Practical guidance

Use sem for:

a single shared output database
a single aggregate output file
a scarce external API
a small shared scratch step

Do not use sem alone if you still need:

a specific GPU number
a specific port number
a specific scratch directory
cleanup tied to that specific resource identity

For that use case, a real lock/allocator is stronger.

18.8 GPU lock allocators for 1, 2, or K jobs per GPU

When workers must learn a specific GPU identity, and you want to allow a fixed number of concurrent jobs per GPU, use a real lock allocator rather than a bare semaphore.

The core idea is simple:

each GPU has one or more lockable slots
each running job must atomically claim one slot
the claimed slot tells the worker which GPU it got
the slot is released when the worker exits

That gives a policy like so:

JOBS_PER_GPU=1: one job per GPU
JOBS_PER_GPU=2: two jobs per GPU
JOBS_PER_GPU=K: K jobs per GPU

This is stronger than a plain counting gate because the worker learns a named resource identity such as GPU_ID=0.

18.9 Why a real allocator is better than fixed slot arithmetic

A fixed mapping like “slot 1 and 2 go to GPU 0, slot 3 and 4 go to GPU 1” can work in simple cases, but a real allocator is usually better when:

job runtimes vary
GPU IDs are non-contiguous
you want to reserve some GPUs and exclude others
workers must clean up explicitly on failure
you want inspectable lock state on disk

Using an explicit list like GPU_IDS="0 2 5 7" also keeps the scheduling policy obvious.

18.10 Mental model: one token = one allowed seat

Suppose:

export GPU_IDS="0 1 2 3"
export JOBS_PER_GPU=2

Then the allowed seats are conceptually:

GPU 0 → slot 1, slot 2
GPU 1 → slot 1, slot 2
GPU 2 → slot 1, slot 2
GPU 3 → slot 1, slot 2

So the total concurrent GPU-backed workers is:

4 GPUs × 2 jobs per GPU = 8 total GPU slots

A job may run only after it claims one of those eight slots.

18.11 Launcher pattern

#!/usr/bin/env bash
set -euo pipefail

mkdir -p logs

export RUN_TAG="gpu_batch"
export GPU_IDS="0 1 2 3"          # explicit GPU pool
export JOBS_PER_GPU=2             # 1, 2, or any other integer K
export WORKER="./worker.sh"       # generic executable entry point
export LOCK_DIR="${TMPDIR:-/tmp}/${RUN_TAG}_gpu_locks"

mkdir -p "$LOCK_DIR"

# Only clear old locks if you are certain no earlier batch is still using them.
rm -rf "$LOCK_DIR"/*

set -- $GPU_IDS
GPU_COUNT=$#
TOTAL_GPU_SLOTS=$(( GPU_COUNT * JOBS_PER_GPU ))

# Preview first
parallel --dry-run --jobs "${TOTAL_GPU_SLOTS}" \
  --env RUN_TAG,GPU_IDS,JOBS_PER_GPU,WORKER,LOCK_DIR \
  'JOB_ID={} /bin/bash ./gpu_lock_wrapper.sh' \
  ::: $(seq 1 100)

# Real run
nohup parallel --line-buffer --jobs "${TOTAL_GPU_SLOTS}" \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,GPU_IDS,JOBS_PER_GPU,WORKER,LOCK_DIR \
  'JOB_ID={} /bin/bash ./gpu_lock_wrapper.sh' \
  ::: $(seq 1 100) \
  > "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &

echo $! > "logs/${RUN_TAG}.pid"

A good default is to set:

--jobs = number_of_GPUs × JOBS_PER_GPU

That way, the outer GNU Parallel concurrency matches the number of GPU seats you actually intend to allow.

If --jobs is larger than the number of available GPU slots, the pattern is still correct, but extra wrappers will simply wait in the lock-acquisition loop.

18.12 GPU lock wrapper

#!/usr/bin/env bash
# File: gpu_lock_wrapper.sh
set -euo pipefail

job_id="${JOB_ID:?JOB_ID is required}"
worker="${WORKER:?WORKER is required}"

GPU_ID=""
GPU_SLOT=""
LOCK_PATH=""
TAG_PATH=""

cleanup() {
  if [ -n "${LOCK_PATH}" ]; then
    rm -rf "${LOCK_PATH}" 2>/dev/null || true
  fi
  if [ -n "${TAG_PATH}" ]; then
    rm -f "${TAG_PATH}" 2>/dev/null || true
  fi
}
trap cleanup EXIT INT TERM HUP

while [ -z "${GPU_ID}" ]; do
  for gpu in $GPU_IDS; do
    for slot in $(seq 1 "${JOBS_PER_GPU}"); do
      candidate="${LOCK_DIR}/gpu${gpu}_slot${slot}.lock"

      # mkdir is used because directory creation is atomic.
      if mkdir "${candidate}" 2>/dev/null; then
        GPU_ID="${gpu}"
        GPU_SLOT="${slot}"
        LOCK_PATH="${candidate}"
        TAG_PATH="${LOCK_DIR}/job_${job_id}_gpu${GPU_ID}_slot${GPU_SLOT}.tag"

        printf 'gpu=%s\nslot=%s\n' "${GPU_ID}" "${GPU_SLOT}" > "${TAG_PATH}"
        break 2
      fi
    done
  done

  if [ -z "${GPU_ID}" ]; then
    sleep 1
  fi
done

echo "start job_id=${job_id} gpu=${GPU_ID} slot=${GPU_SLOT}" >&2

# The worker should read JOB_ID, GPU_ID, and GPU_SLOT from the environment.
# CUDA_VISIBLE_DEVICES is shown here because it is a common GPU selector.
CUDA_VISIBLE_DEVICES="${GPU_ID}" \
GPU_ID="${GPU_ID}" \
GPU_SLOT="${GPU_SLOT}" \
JOB_ID="${job_id}" \
"${worker}"

echo "done job_id=${job_id} gpu=${GPU_ID} slot=${GPU_SLOT}" >&2

This pattern is task agnostic:

the worker can be Python, R, Bash, or anything else executable
the GPU pool can be contiguous or non-contiguous
changing JOBS_PER_GPU changes the policy from 1 to 2 to K jobs per GPU without rewriting the allocator

If your runtime uses a different GPU-selection variable, export that instead of CUDA_VISIBLE_DEVICES.

18.13 Why `mkdir` is used for the lock

A plain semaphore limits how many jobs may proceed, but not which GPU each job received.

The lock-directory approach solves both problems at once:

gpu0_slot1.lock means one seat on GPU 0 is occupied
gpu0_slot2.lock means a second seat on GPU 0 is occupied
if both exist, GPU 0 is full when JOBS_PER_GPU=2

The important property is that creating the directory is atomic, so only one contender can successfully claim the same slot.

18.14 Monitoring the lock state

find "${LOCK_DIR}" -maxdepth 1 -type d -name 'gpu*_slot*.lock' | sort
grep -H . "${LOCK_DIR}"/job_*.tag 2>/dev/null || true
tail -f "logs/${RUN_TAG}.out"
column -t < "logs/${RUN_TAG}_joblog.tsv" | less -S

These show:

which GPU slots are currently occupied
which jobs currently claim which GPUs
the batch-wide stdout/stderr stream
the GNU Parallel job log

18.15 Failure handling and stale locks

The trap handles normal completion and common termination signals, so locks are released when a worker exits in ordinary ways.

Two cases still require manual cleanup:

kill -9
machine crash or sudden power loss

In those cases, the shell cannot run cleanup code, so stale lock directories may remain behind.

A safe cleanup sequence is:

verify the old batch is truly gone
inspect the lock directory
remove stale lock directories only after you are sure nothing still owns them

Using a dedicated LOCK_DIR per batch, or at least per host, makes this easier to reason about.

18.16 Practical guidance for GPU locks

Use this pattern when:

jobs must learn a specific GPU ID
you want 1, 2, or K jobs per GPU
job durations are uneven
you want explicit, inspectable scheduler state

A few rules keep it reliable:

keep GPU_IDS explicit
keep --jobs aligned with the total number of GPU slots
treat JOBS_PER_GPU as a policy limit, not a performance guarantee
increase JOBS_PER_GPU only when memory use and contention remain acceptable
keep the lock directory on a filesystem where mkdir is atomic and visible to all contenders in the same pool
on multi-node runs, use a separate lock namespace per host unless the GPUs are intentionally managed as one shared pool

In short:

use sem for anonymous gates
use a GPU lock allocator when workers must acquire a named GPU and you want a clear 1, 2, or K jobs-per-GPU policy

19. Advanced: Multi-node jobs over SSH

GNU Parallel can distribute jobs across machines directly with --sshlogin / -S or --sshloginfile / --slf, instead of making you hand-write nested ssh ... loops.

That is usually cleaner, easier to log, and easier to scale.

19.1 Preconditions

Before doing multi-node jobs, make sure:

you can SSH from the launch machine to each worker without interactive prompts
GNU Parallel is installed on the remote machines
python3 exists on each remote machine, or you use an absolute interpreter path
you know whether the project directory is shared across machines

19.2 First sanity checks

Start with these:

parallel -S user@node1,user@node2 --nonall hostname
parallel -S user@node1,user@node2 --nonall 'which python3'
parallel -S user@node1,user@node2 --nonall 'parallel --version | head -n1'

If these do not work cleanly, do not launch the real batch yet.

19.3 Shared-filesystem setup

The simplest multi-node case is when every machine sees the same project path.

Example node file:

2/:
1/user@node1
1/user@node2

Interpretation:

: means the local machine
2/: gives the local machine two slots
1/user@node1 gives node1 one slot
1/user@node2 gives node2 one slot

So the total concurrency is 4.

Now launch:

export RUN_TAG="cluster100"
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

parallel --slf nodes.txt \
  --workdir . \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100)

Why this is the clean pattern:

GNU Parallel distributes jobs itself
the job log stays centralized on the launch machine
--workdir . tells GNU Parallel to use the caller’s current working directory on the remote hosts, with special handling when the directory is under your home directory
--env explicitly copies the exported variables, which is especially useful for remote execution

19.4 Why `--sshloginfile` is better than hand-written `ssh` loops

Using --slf nodes.txt or -S host1,host2 is often better because:

GNU Parallel knows which host ran each job
slot counts are managed in one place
file transfer features integrate cleanly
the command itself stays focused on the worker

The more nodes you add, the more valuable that becomes.

19.5 If the filesystem is not shared

If remote machines do not see the same project directory, use GNU Parallel’s transfer features.

The most important ones are:

--basefile file: copy a common file to each remote host before the first job
--transferfile file: copy a per-job input file
--return file: pull a result file back to the launch machine
--cleanup: remove transferred files after the job
--trc file: shorthand for transfer, return, cleanup in file-based workflows

Example pattern for file-based jobs:

find inputs -name '*.json' | parallel --slf nodes.txt \
  --workdir ... \
  --basefile worker.py \
  --transferfile {} \
  --return '{/.}.out' \
  --cleanup \
  'python3 -u ./worker.py {} > {/.}.out'

Interpretation:

worker.py is copied to each node before the first job
each input file is copied to the node that will process it
one output file per input is copied back
temporary transferred files are cleaned up

If your job inputs are integers rather than files, a common pattern is:

ship the worker once with --basefile
keep inputs as replacement strings like JOB_ID={}
return named outputs with --return

19.6 Environment handling on remote nodes

For remote jobs, prefer a short explicit env list:

--env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS

If you eventually need shell functions, aliases, or non-exported variables on remote nodes, GNU Parallel also supports env_parallel. For simple worker-style launchers, --env is usually enough and easier to reason about.

19.7 Heterogeneous cluster advice

If your machines are not identical:

use an absolute Python path if needed
put slot counts in nodes.txt
give slower machines fewer slots
keep thread caps at 1 unless you intentionally want hybrid parallelism

For example:

2/user@fast-node
1/user@slow-node
1/:

This is usually better than pretending all nodes should carry the same load.

19.8 Remote wrapper pattern

If the remote environment needs activation, wrap it explicitly in a shell that supports your activation logic:

parallel --slf nodes.txt \
  --workdir . \
  --env RUN_TAG \
  'bash -lc "source ~/.bashrc && conda activate myenv && JOB_ID={} SLOT={%} python3 -u ./worker.py"' \
  ::: $(seq 1 100)

That is safer than assuming the default remote shell understands source or your environment manager.

19.9 A clean four-slot, two-node example

If you want a compact concrete template for 100 jobs over two remote machines plus local capacity:

# nodes.txt
2/:
1/user@node1
1/user@node2

mkdir -p logs

export RUN_TAG="cluster100"
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

parallel --slf nodes.txt \
  --workdir . \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100)

This gives you:

100 total jobs
4 total concurrent slots across the node pool
one centralized job log
explicit env propagation
the same worker contract as the single-machine version

20. Final recommended command

If you want one command that best matches the single-machine macOS use case, use this:

mkdir -p logs

export RUN_TAG="mac100"
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

nohup parallel --line-buffer --jobs 4 \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100) \
  > "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &

echo $! > "logs/${RUN_TAG}.pid"