GNU Parallel: Running Many Jobs Across Multiple GPUs, CPU Cores, or Compute Nodes to Accelerate AI Workflows
GNU Parallel tutorial -- How to build a HPC cluster for AI using your local machines.
This tutorial shows how to use GNU Parallel to run, say, 100 Python or R jobs with at most N (say, 4) jobs running at the same time, while cleanly passing environment variables to each .py (or .R) worker script. It also shows how to manage jobs across a multi-GPU environment where each GPU can only handle a specific number of jobs, and how to send jobs across machines via ssh under unified GNU parallel management. Learning these principles will help you accelerate your AI workflows by more fully leveraging the compute available to you.
1. Recommended GNU Parallel Practices
Strong patterns when using GNU Parallel for batch jobs include:
nohup parallel ... &for long-running background jobs--joblogfor durable, machine-readable tracking of every job--delayto stagger job starts and avoid thundering-herd problemsseqto generate clean numeric job sequences- Exported environment variables passed via
--envwhen workers need shared context
These techniques make workflows easier to monitor, restart, and debug.
2. Thread Control for Python
GNU Parallel’s -j4 limits the number of worker processes. It does not stop math libraries (NumPy, OpenBLAS, MKL, Accelerate, NumExpr, etc.) from spawning extra threads inside each process.
Without explicit caps, four Python jobs can easily turn into 16–32 runnable threads, causing oversubscription, sluggish performance, and thermal throttling.
Usually, set these limits before launching (this is often the single most important macOS-specific detail):
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
3. Using Environment Variables for Clean Job Context
Keep each Python worker as stateless as practical. Pass configuration through environment variables:
- Shared values (identical for every job): export once +
--env - Per-job values (unique): assign inline right before calling the script
This reduces hidden shared state and makes job context explicit. Typical variables:
RUN_TAG— shared batch identifierJOB_ID— unique per jobSLOT— GNU Parallel slot number (1–4)
4. Install GNU Parallel
brew install parallel
parallel --citation # one-time acknowledgement
parallel --number-of-cores
5. Mental Model: 100 Jobs, Maximum 4 Concurrent
seq 1 100 | parallel -j4 'python3 worker.py'
- 100 total jobs
- At most 4 running at once
- As soon as one finishes, the next queued job starts automatically
-j4 is a concurrency limit, not hard CPU pinning. macOS decides the actual core placement.
6. Essential Placeholders
| Placeholder | Meaning | Typical use |
|---|---|---|
{} |
The input item | JOB_ID |
{#} |
Job sequence number | overall counter |
{%} |
Current slot number (1 to N) | SLOT |
Quick demo:
seq 1 6 | parallel -j4 'echo input={} seq={#} slot={%}'
{%} is a GNU Parallel slot number, not a CPU core ID.
7. Minimal Stateless Python Worker
#!/usr/bin/env python3
import os
import time
from pathlib import Path
job_id = int(os.environ["JOB_ID"])
slot = int(os.environ["SLOT"])
run_tag = os.environ["RUN_TAG"]
Path("logs").mkdir(exist_ok=True)
print(f"start job_id={job_id} slot={slot} run_tag={run_tag}", flush=True)
# Replace with your actual work here
time.sleep(1)
with open(f"logs/{run_tag}_job_{job_id:03d}.txt", "w") as f:
f.write(f"job_id={job_id}\nslot={slot}\nrun_tag={run_tag}\n")
print(f"done job_id={job_id} slot={slot}", flush=True)
Each worker gets its own JOB_ID and writes to its own output file, which avoids output collisions in the common case.
8. Reusable Launcher Script
Save a script like this (tailored to your task) as a .sh file. and invoke it via the terminal (e.g., sh launch_script.sh).
#!/usr/bin/env bash
set -euo pipefail
mkdir -p logs
export RUN_TAG="run_2026"
# Thread caps
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
# Preview first (highly recommended)
parallel --dry-run --line-buffer --jobs 4 \
--joblog "logs/${RUN_TAG}_joblog.tsv" \
--delay 0.2 \
--env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
::: $(seq 1 100)
# Real run in background
nohup parallel --line-buffer --jobs 4 \
--joblog "logs/${RUN_TAG}_joblog.tsv" \
--delay 0.2 \
--env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
::: $(seq 1 100) \
> "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &
echo $! > "logs/${RUN_TAG}.pid"
Two notes:
set -euo pipefailhelps catch setup errors in the launcher itself, but it does not make the script fail fast on errors that happen later inside the detachednohup ... &job.- Writing the PID to
logs/${RUN_TAG}.pidmakes it easier to inspect or stop the batch later.
9. Explaining Each Flag
set -euo pipefail— catches many launcher-script mistakes earlymkdir -p logs— guarantees the output directory exists- Thread exports — reduce oversubscription risk
--dry-run— shows exactly what will run before committing--line-buffer— makes live output easier to read; complete lines from different jobs can still appear interleaved--jobs 4— concurrency limit--joblog— TSV record of every job (start time, runtime, exit code, command)--delay 0.2— gentle stagger--env ...— makes dependencies explicit and is especially useful for remote jobsJOB_ID={} SLOT={%}— per-job context assignmentnohup ... &— lets the batch survive terminal close
10. Shared vs Per-Job Environment Variables
# Shared (export + optional --env for explicitness)
export RUN_TAG=batchA
parallel --env RUN_TAG 'python3 worker.py' ::: $(seq 1 10)
# Per-job (inline)
parallel 'JOB_ID={} python3 worker.py' ::: $(seq 1 10)
# Combined pattern
export RUN_TAG=batchA
parallel --env RUN_TAG 'JOB_ID={} SLOT={%} python3 worker.py' ::: $(seq 1 10)
11. Good Logging Habits
- Each job writes its own output file when practical
- Keep one batch-wide
.outand.errfile - Always keep the
--joblogTSV - Save the launcher PID if you detach the batch
12. Monitoring Commands
tail -f logs/${RUN_TAG}.out
tail -n +2 logs/${RUN_TAG}_joblog.tsv | wc -l # jobs logged so far
column -t < logs/${RUN_TAG}_joblog.tsv | less -S
ps aux | grep '[p]ython3 -u ./worker.py'
cat logs/${RUN_TAG}.pid
If you need the final success/failure status of every job, treat --joblog as the source of truth.
13. Validation Sequence
- Dry-run with placeholders
- Tiny 8-job test
- Verify logs and per-job files
- Scale to 100 (or more)
14. Common Pitfalls & Fixes
| Mistake | Symptom | Fix |
|---|---|---|
| Forgetting thread caps | High CPU usage, slowdown | Export the five *_NUM_THREADS=1 vars |
| All workers writing to same file | Interleaved/corrupted output | Use unique filenames per JOB_ID |
| Hidden shared state | Race conditions, non-reproducible runs | Pass configuration explicitly and isolate outputs |
Skipping --dry-run |
Quoting or placeholder bugs | Preview first |
Expecting -j4 = perfect pinning |
Confusion about scheduler behavior | Treat -jN as a concurrency limit |
15. When --env Is Required
For purely local runs, exported variables are usually inherited by child processes already. Using --env is still a good habit because it:
- Makes dependencies explicit
- Improves readability and maintainability
- Future-proofs the script for remote execution
16. Laptop-Specific Tip
Prevent macOS from sleeping during long foreground runs:
caffeinate -i bash your_launcher.sh
If your launcher immediately detaches the real work with nohup ... &, then caffeinate -i bash your_launcher.sh will end as soon as the launcher exits. In that case, wrap caffeinate around the long-lived parallel command itself, or run the batch in the foreground while caffeinate is active.
17. One-Command Pattern
mkdir -p logs
export RUN_TAG="run_2026"
export OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 VECLIB_MAXIMUM_THREADS=1 NUMEXPR_NUM_THREADS=1
nohup parallel --line-buffer --jobs 4 \
--joblog "logs/${RUN_TAG}_joblog.tsv" \
--delay 0.2 \
--env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
::: $(seq 1 100) \
> "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &
echo $! > "logs/${RUN_TAG}.pid"
18. Advanced: GPU locking with sem
GNU Parallel also has a semaphore tool:
sem
sem is an alias for:
parallel --semaphore
Use it when most of the pipeline is parallel, but one narrow step must be serialized or rate-limited.
Good uses include:
- one SQLite writer at a time
- one tar/zip append step at a time
- only two calls at a time to a rate-limited external API
- forcing narrow disk-I/O sections to happen one at a time
18.1 When sem is the right tool
sem is great when you need a mutex or a counting gate.
A different pattern is needed when the resource has an identity and the worker must know which one it got. In that case, use a real lock/allocator that records the assigned resource and cleans up reliably.
Examples where identity matters:
- GPU
0vs GPU1 - a specific port number
- a specific scratch directory
sem limits how many jobs may enter a critical section, but it does not naturally tell the worker which resource ID it acquired.
So the rule is:
- if you only need a mutex or a counting gate, use
sem - if you need to allocate a specific resource identity, use a real lock/allocator
Intuitively, sem is great when you need a mutex or a counting gate. That is, the protected resource is anonymous — workers only need to know "it's my turn," not "which one did I get."
For example, suppose 50 jobs each produce a small CSV and you want to append them all into one combined file:
for f in results/*.csv; do
sem -j1 --id append_lock "cat '$f' >> combined.csv"
done
sem --wait --id append_lock
No job needs to know anything about the lock itself — it just needs the guarantee that only one cat >> combined.csv runs at a time so lines don't interleave. The semaphore is a faceless gate: enter, do the write, leave.
Similarly, with a counting gate (-j3 instead of -j1), three workers can hit an API simultaneously but the fourth blocks until a slot opens. No worker needs to know "I'm in slot 2" — it just needs permission to go.
Contrast this with something like GPU assignment: if you have two GPUs and each job must set CUDA_VISIBLE_DEVICES=0 or CUDA_VISIBLE_DEVICES=1, a bare semaphore can't help — the worker needs to know which GPU it got. That requires an allocator (e.g., a FIFO queue of GPU IDs) rather than a simple count-based gate.
18.2 Important sem behavior
Three details matter:
semdefaults to background mode- in scripts, you should set an explicit semaphore name with
--id - if you intentionally queue background semaphore jobs, finish with
sem --wait --id NAME
The explicit --id matters because the default semaphore name is tied to the controlling TTY, which is convenient interactively but fragile in scripts.
When the current shell must block until the protected command finishes, add --fg.
18.3 Simple mutex example
Suppose the main computation can run in parallel, but the final append step must happen one job at a time:
parallel --jobs 4 --env RUN_TAG \
'JOB_ID={} SLOT={%} python3 -u ./compute_stage.py &&
sem --fg --id "${RUN_TAG}_append" -j1 \
"JOB_ID={} python3 -u ./append_one_result.py"' \
::: $(seq 1 100)
What is happening here:
- GNU Parallel runs four compute jobs concurrently
- each job reaches the append step when ready
sem --fg --id "${RUN_TAG}_append" -j1turns the append step into a mutex--fgis important because the outer job should wait until the protected step finishes
If you omit --fg, sem will normally queue the protected command in the background and the outer GNU Parallel job may finish too early.
18.4 Counting semaphore example
Suppose only two workers at a time may call an expensive remote service:
parallel --jobs 4 \
'JOB_ID={} python3 -u ./prepare_request.py &&
sem --fg --id paid_api -j2 \
"JOB_ID={} python3 -u ./call_paid_api.py"' \
::: $(seq 1 100)
This means:
- four jobs can do the cheap local preparation concurrently
- only two jobs can be inside
call_paid_api.pyat once
18.5 Background semaphore pattern
Sometimes you do want the semaphore-wrapped commands to queue in the background. In that case, explicitly wait for them:
for f in logs/*.json; do
sem --id merge_json -j1 python3 -u ./merge_one_file.py "$f"
done
sem --wait --id merge_json
That final sem --wait is what turns “jobs were queued” into “the merge is definitely complete.”
18.6 Locking and disk I/O
The GNU Parallel manual specifically calls out semaphores as a way to reduce disk contention. If your workload is “read everything, compute, write everything,” a narrow serialized I/O section can outperform letting all jobs hammer the same disk at once.
This is especially relevant if your Python workers all read from or write to a single slow external disk.
18.7 Practical guidance
Use sem for:
- a single shared output database
- a single aggregate output file
- a scarce external API
- a small shared scratch step
Do not use sem alone if you still need:
- a specific GPU number
- a specific port number
- a specific scratch directory
- cleanup tied to that specific resource identity
For that use case, a real lock/allocator is stronger.
18.8 GPU lock allocators for 1, 2, or K jobs per GPU
When workers must learn a specific GPU identity, and you want to allow a fixed number of concurrent jobs per GPU, use a real lock allocator rather than a bare semaphore.
The core idea is simple:
- each GPU has one or more lockable slots
- each running job must atomically claim one slot
- the claimed slot tells the worker which GPU it got
- the slot is released when the worker exits
That gives a policy like so:
JOBS_PER_GPU=1: one job per GPUJOBS_PER_GPU=2: two jobs per GPUJOBS_PER_GPU=K: K jobs per GPU
This is stronger than a plain counting gate because the worker learns a named resource identity such as GPU_ID=0.
18.9 Why a real allocator is better than fixed slot arithmetic
A fixed mapping like “slot 1 and 2 go to GPU 0, slot 3 and 4 go to GPU 1” can work in simple cases, but a real allocator is usually better when:
- job runtimes vary
- GPU IDs are non-contiguous
- you want to reserve some GPUs and exclude others
- workers must clean up explicitly on failure
- you want inspectable lock state on disk
Using an explicit list like GPU_IDS="0 2 5 7" also keeps the scheduling policy obvious.
18.10 Mental model: one token = one allowed seat
Suppose:
export GPU_IDS="0 1 2 3"
export JOBS_PER_GPU=2
Then the allowed seats are conceptually:
- GPU 0 → slot 1, slot 2
- GPU 1 → slot 1, slot 2
- GPU 2 → slot 1, slot 2
- GPU 3 → slot 1, slot 2
So the total concurrent GPU-backed workers is:
4 GPUs × 2 jobs per GPU = 8 total GPU slots
A job may run only after it claims one of those eight slots.
18.11 Launcher pattern
#!/usr/bin/env bash
set -euo pipefail
mkdir -p logs
export RUN_TAG="gpu_batch"
export GPU_IDS="0 1 2 3" # explicit GPU pool
export JOBS_PER_GPU=2 # 1, 2, or any other integer K
export WORKER="./worker.sh" # generic executable entry point
export LOCK_DIR="${TMPDIR:-/tmp}/${RUN_TAG}_gpu_locks"
mkdir -p "$LOCK_DIR"
# Only clear old locks if you are certain no earlier batch is still using them.
rm -rf "$LOCK_DIR"/*
set -- $GPU_IDS
GPU_COUNT=$#
TOTAL_GPU_SLOTS=$(( GPU_COUNT * JOBS_PER_GPU ))
# Preview first
parallel --dry-run --jobs "${TOTAL_GPU_SLOTS}" \
--env RUN_TAG,GPU_IDS,JOBS_PER_GPU,WORKER,LOCK_DIR \
'JOB_ID={} /bin/bash ./gpu_lock_wrapper.sh' \
::: $(seq 1 100)
# Real run
nohup parallel --line-buffer --jobs "${TOTAL_GPU_SLOTS}" \
--joblog "logs/${RUN_TAG}_joblog.tsv" \
--delay 0.2 \
--env RUN_TAG,GPU_IDS,JOBS_PER_GPU,WORKER,LOCK_DIR \
'JOB_ID={} /bin/bash ./gpu_lock_wrapper.sh' \
::: $(seq 1 100) \
> "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &
echo $! > "logs/${RUN_TAG}.pid"
A good default is to set:
--jobs = number_of_GPUs × JOBS_PER_GPU
That way, the outer GNU Parallel concurrency matches the number of GPU seats you actually intend to allow.
If --jobs is larger than the number of available GPU slots, the pattern is still correct, but extra wrappers will simply wait in the lock-acquisition loop.
18.12 GPU lock wrapper
#!/usr/bin/env bash
# File: gpu_lock_wrapper.sh
set -euo pipefail
job_id="${JOB_ID:?JOB_ID is required}"
worker="${WORKER:?WORKER is required}"
GPU_ID=""
GPU_SLOT=""
LOCK_PATH=""
TAG_PATH=""
cleanup() {
if [ -n "${LOCK_PATH}" ]; then
rm -rf "${LOCK_PATH}" 2>/dev/null || true
fi
if [ -n "${TAG_PATH}" ]; then
rm -f "${TAG_PATH}" 2>/dev/null || true
fi
}
trap cleanup EXIT INT TERM HUP
while [ -z "${GPU_ID}" ]; do
for gpu in $GPU_IDS; do
for slot in $(seq 1 "${JOBS_PER_GPU}"); do
candidate="${LOCK_DIR}/gpu${gpu}_slot${slot}.lock"
# mkdir is used because directory creation is atomic.
if mkdir "${candidate}" 2>/dev/null; then
GPU_ID="${gpu}"
GPU_SLOT="${slot}"
LOCK_PATH="${candidate}"
TAG_PATH="${LOCK_DIR}/job_${job_id}_gpu${GPU_ID}_slot${GPU_SLOT}.tag"
printf 'gpu=%s\nslot=%s\n' "${GPU_ID}" "${GPU_SLOT}" > "${TAG_PATH}"
break 2
fi
done
done
if [ -z "${GPU_ID}" ]; then
sleep 1
fi
done
echo "start job_id=${job_id} gpu=${GPU_ID} slot=${GPU_SLOT}" >&2
# The worker should read JOB_ID, GPU_ID, and GPU_SLOT from the environment.
# CUDA_VISIBLE_DEVICES is shown here because it is a common GPU selector.
CUDA_VISIBLE_DEVICES="${GPU_ID}" \
GPU_ID="${GPU_ID}" \
GPU_SLOT="${GPU_SLOT}" \
JOB_ID="${job_id}" \
"${worker}"
echo "done job_id=${job_id} gpu=${GPU_ID} slot=${GPU_SLOT}" >&2
This pattern is task agnostic:
- the worker can be Python, R, Bash, or anything else executable
- the GPU pool can be contiguous or non-contiguous
- changing
JOBS_PER_GPUchanges the policy from 1 to 2 to K jobs per GPU without rewriting the allocator
If your runtime uses a different GPU-selection variable, export that instead of CUDA_VISIBLE_DEVICES.
18.13 Why mkdir is used for the lock
A plain semaphore limits how many jobs may proceed, but not which GPU each job received.
The lock-directory approach solves both problems at once:
gpu0_slot1.lockmeans one seat on GPU0is occupiedgpu0_slot2.lockmeans a second seat on GPU0is occupied- if both exist, GPU
0is full whenJOBS_PER_GPU=2
The important property is that creating the directory is atomic, so only one contender can successfully claim the same slot.
18.14 Monitoring the lock state
find "${LOCK_DIR}" -maxdepth 1 -type d -name 'gpu*_slot*.lock' | sort
grep -H . "${LOCK_DIR}"/job_*.tag 2>/dev/null || true
tail -f "logs/${RUN_TAG}.out"
column -t < "logs/${RUN_TAG}_joblog.tsv" | less -S
These show:
- which GPU slots are currently occupied
- which jobs currently claim which GPUs
- the batch-wide stdout/stderr stream
- the GNU Parallel job log
18.15 Failure handling and stale locks
The trap handles normal completion and common termination signals, so locks are released when a worker exits in ordinary ways.
Two cases still require manual cleanup:
kill -9- machine crash or sudden power loss
In those cases, the shell cannot run cleanup code, so stale lock directories may remain behind.
A safe cleanup sequence is:
- verify the old batch is truly gone
- inspect the lock directory
- remove stale lock directories only after you are sure nothing still owns them
Using a dedicated LOCK_DIR per batch, or at least per host, makes this easier to reason about.
18.16 Practical guidance for GPU locks
Use this pattern when:
- jobs must learn a specific GPU ID
- you want 1, 2, or K jobs per GPU
- job durations are uneven
- you want explicit, inspectable scheduler state
A few rules keep it reliable:
- keep
GPU_IDSexplicit - keep
--jobsaligned with the total number of GPU slots - treat
JOBS_PER_GPUas a policy limit, not a performance guarantee - increase
JOBS_PER_GPUonly when memory use and contention remain acceptable - keep the lock directory on a filesystem where
mkdiris atomic and visible to all contenders in the same pool - on multi-node runs, use a separate lock namespace per host unless the GPUs are intentionally managed as one shared pool
In short:
- use
semfor anonymous gates - use a GPU lock allocator when workers must acquire a named GPU and you want a clear 1, 2, or K jobs-per-GPU policy
19. Advanced: Multi-node jobs over SSH
GNU Parallel can distribute jobs across machines directly with --sshlogin / -S or --sshloginfile / --slf, instead of making you hand-write nested ssh ... loops.
That is usually cleaner, easier to log, and easier to scale.
19.1 Preconditions
Before doing multi-node jobs, make sure:
- you can SSH from the launch machine to each worker without interactive prompts
- GNU Parallel is installed on the remote machines
python3exists on each remote machine, or you use an absolute interpreter path- you know whether the project directory is shared across machines
19.2 First sanity checks
Start with these:
parallel -S user@node1,user@node2 --nonall hostname
parallel -S user@node1,user@node2 --nonall 'which python3'
parallel -S user@node1,user@node2 --nonall 'parallel --version | head -n1'
If these do not work cleanly, do not launch the real batch yet.
19.3 Shared-filesystem setup
The simplest multi-node case is when every machine sees the same project path.
Example node file:
2/:
1/user@node1
1/user@node2
Interpretation:
:means the local machine2/:gives the local machine two slots1/user@node1givesnode1one slot1/user@node2givesnode2one slot
So the total concurrency is 4.
Now launch:
export RUN_TAG="cluster100"
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
parallel --slf nodes.txt \
--workdir . \
--joblog "logs/${RUN_TAG}_joblog.tsv" \
--delay 0.2 \
--env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
::: $(seq 1 100)
Why this is the clean pattern:
- GNU Parallel distributes jobs itself
- the job log stays centralized on the launch machine
--workdir .tells GNU Parallel to use the caller’s current working directory on the remote hosts, with special handling when the directory is under your home directory--envexplicitly copies the exported variables, which is especially useful for remote execution
19.4 Why --sshloginfile is better than hand-written ssh loops
Using --slf nodes.txt or -S host1,host2 is often better because:
- GNU Parallel knows which host ran each job
- slot counts are managed in one place
- file transfer features integrate cleanly
- the command itself stays focused on the worker
The more nodes you add, the more valuable that becomes.
19.5 If the filesystem is not shared
If remote machines do not see the same project directory, use GNU Parallel’s transfer features.
The most important ones are:
--basefile file: copy a common file to each remote host before the first job--transferfile file: copy a per-job input file--return file: pull a result file back to the launch machine--cleanup: remove transferred files after the job--trc file: shorthand for transfer, return, cleanup in file-based workflows
Example pattern for file-based jobs:
find inputs -name '*.json' | parallel --slf nodes.txt \
--workdir ... \
--basefile worker.py \
--transferfile {} \
--return '{/.}.out' \
--cleanup \
'python3 -u ./worker.py {} > {/.}.out'
Interpretation:
worker.pyis copied to each node before the first job- each input file is copied to the node that will process it
- one output file per input is copied back
- temporary transferred files are cleaned up
If your job inputs are integers rather than files, a common pattern is:
- ship the worker once with
--basefile - keep inputs as replacement strings like
JOB_ID={} - return named outputs with
--return
19.6 Environment handling on remote nodes
For remote jobs, prefer a short explicit env list:
--env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS
If you eventually need shell functions, aliases, or non-exported variables on remote nodes, GNU Parallel also supports env_parallel. For simple worker-style launchers, --env is usually enough and easier to reason about.
19.7 Heterogeneous cluster advice
If your machines are not identical:
- use an absolute Python path if needed
- put slot counts in
nodes.txt - give slower machines fewer slots
- keep thread caps at
1unless you intentionally want hybrid parallelism
For example:
2/user@fast-node
1/user@slow-node
1/:
This is usually better than pretending all nodes should carry the same load.
19.8 Remote wrapper pattern
If the remote environment needs activation, wrap it explicitly in a shell that supports your activation logic:
parallel --slf nodes.txt \
--workdir . \
--env RUN_TAG \
'bash -lc "source ~/.bashrc && conda activate myenv && JOB_ID={} SLOT={%} python3 -u ./worker.py"' \
::: $(seq 1 100)
That is safer than assuming the default remote shell understands source or your environment manager.
19.9 A clean four-slot, two-node example
If you want a compact concrete template for 100 jobs over two remote machines plus local capacity:
# nodes.txt
2/:
1/user@node1
1/user@node2
mkdir -p logs
export RUN_TAG="cluster100"
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
parallel --slf nodes.txt \
--workdir . \
--joblog "logs/${RUN_TAG}_joblog.tsv" \
--delay 0.2 \
--env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
::: $(seq 1 100)
This gives you:
100total jobs4total concurrent slots across the node pool- one centralized job log
- explicit env propagation
- the same worker contract as the single-machine version
20. Final recommended command
If you want one command that best matches the single-machine macOS use case, use this:
mkdir -p logs
export RUN_TAG="mac100"
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
nohup parallel --line-buffer --jobs 4 \
--joblog "logs/${RUN_TAG}_joblog.tsv" \
--delay 0.2 \
--env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
::: $(seq 1 100) \
> "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &
echo $! > "logs/${RUN_TAG}.pid"
This gives you:
100total jobs4concurrent Python workers- shared batch context via
RUN_TAG - per-job context via
JOB_IDandSLOT - reproducible logs
- a macOS-safe default for CPU-heavy Python work
21. Bottom line
For clean, high-performance GNU Parallel workflows for Python:
- export shared run-level variables once
- cap math-library threads to
1 - use
parallel --jobs N --joblog ... --delay ... - assign per-job env vars inline with
JOB_ID={}andSLOT={%} - keep the Python worker as stateless as practical
- write per-job outputs and keep a global job log
- use
sem --id ... --fgonly around narrow critical sections - use a real lock/allocator when a worker must learn a specific resource identity
- for multi-node jobs, prefer
--slfor-Sover hand-written nestedssh - when the filesystem is not shared, use
--basefile,--transferfile,--return, and--cleanup