GNU Parallel: Running Many Jobs Across Multiple GPUs, CPU Cores, or Compute Nodes to Accelerate AI Workflows

Community Article Published March 7, 2026

GNU Parallel tutorial -- How to build a HPC cluster for AI using your local machines.

This tutorial shows how to use GNU Parallel to run, say, 100 Python or R jobs with at most N (say, 4) jobs running at the same time, while cleanly passing environment variables to each .py (or .R) worker script. It also shows how to manage jobs across a multi-GPU environment where each GPU can only handle a specific number of jobs, and how to send jobs across machines via ssh under unified GNU parallel management. Learning these principles will help you accelerate your AI workflows by more fully leveraging the compute available to you.

1. Recommended GNU Parallel Practices

Strong patterns when using GNU Parallel for batch jobs include:

  • nohup parallel ... & for long-running background jobs
  • --joblog for durable, machine-readable tracking of every job
  • --delay to stagger job starts and avoid thundering-herd problems
  • seq to generate clean numeric job sequences
  • Exported environment variables passed via --env when workers need shared context

These techniques make workflows easier to monitor, restart, and debug.

2. Thread Control for Python

GNU Parallel’s -j4 limits the number of worker processes. It does not stop math libraries (NumPy, OpenBLAS, MKL, Accelerate, NumExpr, etc.) from spawning extra threads inside each process.

Without explicit caps, four Python jobs can easily turn into 16–32 runnable threads, causing oversubscription, sluggish performance, and thermal throttling.

Usually, set these limits before launching (this is often the single most important macOS-specific detail):

export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

3. Using Environment Variables for Clean Job Context

Keep each Python worker as stateless as practical. Pass configuration through environment variables:

  • Shared values (identical for every job): export once + --env
  • Per-job values (unique): assign inline right before calling the script

This reduces hidden shared state and makes job context explicit. Typical variables:

  • RUN_TAG — shared batch identifier
  • JOB_ID — unique per job
  • SLOT — GNU Parallel slot number (1–4)

4. Install GNU Parallel

brew install parallel
parallel --citation          # one-time acknowledgement
parallel --number-of-cores

5. Mental Model: 100 Jobs, Maximum 4 Concurrent

seq 1 100 | parallel -j4 'python3 worker.py'
  • 100 total jobs
  • At most 4 running at once
  • As soon as one finishes, the next queued job starts automatically

-j4 is a concurrency limit, not hard CPU pinning. macOS decides the actual core placement.

6. Essential Placeholders

Placeholder Meaning Typical use
{} The input item JOB_ID
{#} Job sequence number overall counter
{%} Current slot number (1 to N) SLOT

Quick demo:

seq 1 6 | parallel -j4 'echo input={} seq={#} slot={%}'

{%} is a GNU Parallel slot number, not a CPU core ID.

7. Minimal Stateless Python Worker

#!/usr/bin/env python3
import os
import time
from pathlib import Path

job_id = int(os.environ["JOB_ID"])
slot = int(os.environ["SLOT"])
run_tag = os.environ["RUN_TAG"]

Path("logs").mkdir(exist_ok=True)

print(f"start job_id={job_id} slot={slot} run_tag={run_tag}", flush=True)

# Replace with your actual work here
time.sleep(1)

with open(f"logs/{run_tag}_job_{job_id:03d}.txt", "w") as f:
    f.write(f"job_id={job_id}\nslot={slot}\nrun_tag={run_tag}\n")

print(f"done job_id={job_id} slot={slot}", flush=True)

Each worker gets its own JOB_ID and writes to its own output file, which avoids output collisions in the common case.

8. Reusable Launcher Script

Save a script like this (tailored to your task) as a .sh file. and invoke it via the terminal (e.g., sh launch_script.sh).

#!/usr/bin/env bash
set -euo pipefail

mkdir -p logs

export RUN_TAG="run_2026"

# Thread caps
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

# Preview first (highly recommended)
parallel --dry-run --line-buffer --jobs 4 \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100)

# Real run in background
nohup parallel --line-buffer --jobs 4 \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100) \
  > "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &

echo $! > "logs/${RUN_TAG}.pid"

Two notes:

  • set -euo pipefail helps catch setup errors in the launcher itself, but it does not make the script fail fast on errors that happen later inside the detached nohup ... & job.
  • Writing the PID to logs/${RUN_TAG}.pid makes it easier to inspect or stop the batch later.

9. Explaining Each Flag

  • set -euo pipefail — catches many launcher-script mistakes early
  • mkdir -p logs — guarantees the output directory exists
  • Thread exports — reduce oversubscription risk
  • --dry-run — shows exactly what will run before committing
  • --line-buffer — makes live output easier to read; complete lines from different jobs can still appear interleaved
  • --jobs 4 — concurrency limit
  • --joblog — TSV record of every job (start time, runtime, exit code, command)
  • --delay 0.2 — gentle stagger
  • --env ... — makes dependencies explicit and is especially useful for remote jobs
  • JOB_ID={} SLOT={%} — per-job context assignment
  • nohup ... & — lets the batch survive terminal close

10. Shared vs Per-Job Environment Variables

# Shared (export + optional --env for explicitness)
export RUN_TAG=batchA
parallel --env RUN_TAG 'python3 worker.py' ::: $(seq 1 10)

# Per-job (inline)
parallel 'JOB_ID={} python3 worker.py' ::: $(seq 1 10)

# Combined pattern 
export RUN_TAG=batchA
parallel --env RUN_TAG 'JOB_ID={} SLOT={%} python3 worker.py' ::: $(seq 1 10)

11. Good Logging Habits

  • Each job writes its own output file when practical
  • Keep one batch-wide .out and .err file
  • Always keep the --joblog TSV
  • Save the launcher PID if you detach the batch

12. Monitoring Commands

tail -f logs/${RUN_TAG}.out
tail -n +2 logs/${RUN_TAG}_joblog.tsv | wc -l     # jobs logged so far
column -t < logs/${RUN_TAG}_joblog.tsv | less -S
ps aux | grep '[p]ython3 -u ./worker.py'
cat logs/${RUN_TAG}.pid

If you need the final success/failure status of every job, treat --joblog as the source of truth.

13. Validation Sequence

  1. Dry-run with placeholders
  2. Tiny 8-job test
  3. Verify logs and per-job files
  4. Scale to 100 (or more)

14. Common Pitfalls & Fixes

Mistake Symptom Fix
Forgetting thread caps High CPU usage, slowdown Export the five *_NUM_THREADS=1 vars
All workers writing to same file Interleaved/corrupted output Use unique filenames per JOB_ID
Hidden shared state Race conditions, non-reproducible runs Pass configuration explicitly and isolate outputs
Skipping --dry-run Quoting or placeholder bugs Preview first
Expecting -j4 = perfect pinning Confusion about scheduler behavior Treat -jN as a concurrency limit

15. When --env Is Required

For purely local runs, exported variables are usually inherited by child processes already. Using --env is still a good habit because it:

  • Makes dependencies explicit
  • Improves readability and maintainability
  • Future-proofs the script for remote execution

16. Laptop-Specific Tip

Prevent macOS from sleeping during long foreground runs:

caffeinate -i bash your_launcher.sh

If your launcher immediately detaches the real work with nohup ... &, then caffeinate -i bash your_launcher.sh will end as soon as the launcher exits. In that case, wrap caffeinate around the long-lived parallel command itself, or run the batch in the foreground while caffeinate is active.

17. One-Command Pattern

mkdir -p logs
export RUN_TAG="run_2026"
export OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 MKL_NUM_THREADS=1 VECLIB_MAXIMUM_THREADS=1 NUMEXPR_NUM_THREADS=1

nohup parallel --line-buffer --jobs 4 \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100) \
  > "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &

echo $! > "logs/${RUN_TAG}.pid"

18. Advanced: GPU locking with sem

GNU Parallel also has a semaphore tool:

sem

sem is an alias for:

parallel --semaphore

Use it when most of the pipeline is parallel, but one narrow step must be serialized or rate-limited.

Good uses include:

  • one SQLite writer at a time
  • one tar/zip append step at a time
  • only two calls at a time to a rate-limited external API
  • forcing narrow disk-I/O sections to happen one at a time

18.1 When sem is the right tool

sem is great when you need a mutex or a counting gate.

A different pattern is needed when the resource has an identity and the worker must know which one it got. In that case, use a real lock/allocator that records the assigned resource and cleans up reliably.

Examples where identity matters:

  • GPU 0 vs GPU 1
  • a specific port number
  • a specific scratch directory

sem limits how many jobs may enter a critical section, but it does not naturally tell the worker which resource ID it acquired.

So the rule is:

  • if you only need a mutex or a counting gate, use sem
  • if you need to allocate a specific resource identity, use a real lock/allocator

Intuitively, sem is great when you need a mutex or a counting gate. That is, the protected resource is anonymous — workers only need to know "it's my turn," not "which one did I get."

For example, suppose 50 jobs each produce a small CSV and you want to append them all into one combined file:

for f in results/*.csv; do
  sem -j1 --id append_lock "cat '$f' >> combined.csv"
done
sem --wait --id append_lock

No job needs to know anything about the lock itself — it just needs the guarantee that only one cat >> combined.csv runs at a time so lines don't interleave. The semaphore is a faceless gate: enter, do the write, leave.

Similarly, with a counting gate (-j3 instead of -j1), three workers can hit an API simultaneously but the fourth blocks until a slot opens. No worker needs to know "I'm in slot 2" — it just needs permission to go.

Contrast this with something like GPU assignment: if you have two GPUs and each job must set CUDA_VISIBLE_DEVICES=0 or CUDA_VISIBLE_DEVICES=1, a bare semaphore can't help — the worker needs to know which GPU it got. That requires an allocator (e.g., a FIFO queue of GPU IDs) rather than a simple count-based gate.

18.2 Important sem behavior

Three details matter:

  1. sem defaults to background mode
  2. in scripts, you should set an explicit semaphore name with --id
  3. if you intentionally queue background semaphore jobs, finish with sem --wait --id NAME

The explicit --id matters because the default semaphore name is tied to the controlling TTY, which is convenient interactively but fragile in scripts.

When the current shell must block until the protected command finishes, add --fg.

18.3 Simple mutex example

Suppose the main computation can run in parallel, but the final append step must happen one job at a time:

parallel --jobs 4 --env RUN_TAG \
  'JOB_ID={} SLOT={%} python3 -u ./compute_stage.py &&
   sem --fg --id "${RUN_TAG}_append" -j1 \
     "JOB_ID={} python3 -u ./append_one_result.py"' \
  ::: $(seq 1 100)

What is happening here:

  • GNU Parallel runs four compute jobs concurrently
  • each job reaches the append step when ready
  • sem --fg --id "${RUN_TAG}_append" -j1 turns the append step into a mutex
  • --fg is important because the outer job should wait until the protected step finishes

If you omit --fg, sem will normally queue the protected command in the background and the outer GNU Parallel job may finish too early.

18.4 Counting semaphore example

Suppose only two workers at a time may call an expensive remote service:

parallel --jobs 4 \
  'JOB_ID={} python3 -u ./prepare_request.py &&
   sem --fg --id paid_api -j2 \
     "JOB_ID={} python3 -u ./call_paid_api.py"' \
  ::: $(seq 1 100)

This means:

  • four jobs can do the cheap local preparation concurrently
  • only two jobs can be inside call_paid_api.py at once

18.5 Background semaphore pattern

Sometimes you do want the semaphore-wrapped commands to queue in the background. In that case, explicitly wait for them:

for f in logs/*.json; do
  sem --id merge_json -j1 python3 -u ./merge_one_file.py "$f"
done

sem --wait --id merge_json

That final sem --wait is what turns “jobs were queued” into “the merge is definitely complete.”

18.6 Locking and disk I/O

The GNU Parallel manual specifically calls out semaphores as a way to reduce disk contention. If your workload is “read everything, compute, write everything,” a narrow serialized I/O section can outperform letting all jobs hammer the same disk at once.

This is especially relevant if your Python workers all read from or write to a single slow external disk.

18.7 Practical guidance

Use sem for:

  • a single shared output database
  • a single aggregate output file
  • a scarce external API
  • a small shared scratch step

Do not use sem alone if you still need:

  • a specific GPU number
  • a specific port number
  • a specific scratch directory
  • cleanup tied to that specific resource identity

For that use case, a real lock/allocator is stronger.

18.8 GPU lock allocators for 1, 2, or K jobs per GPU

When workers must learn a specific GPU identity, and you want to allow a fixed number of concurrent jobs per GPU, use a real lock allocator rather than a bare semaphore.

The core idea is simple:

  • each GPU has one or more lockable slots
  • each running job must atomically claim one slot
  • the claimed slot tells the worker which GPU it got
  • the slot is released when the worker exits

That gives a policy like so:

  • JOBS_PER_GPU=1: one job per GPU
  • JOBS_PER_GPU=2: two jobs per GPU
  • JOBS_PER_GPU=K: K jobs per GPU

This is stronger than a plain counting gate because the worker learns a named resource identity such as GPU_ID=0.

18.9 Why a real allocator is better than fixed slot arithmetic

A fixed mapping like “slot 1 and 2 go to GPU 0, slot 3 and 4 go to GPU 1” can work in simple cases, but a real allocator is usually better when:

  • job runtimes vary
  • GPU IDs are non-contiguous
  • you want to reserve some GPUs and exclude others
  • workers must clean up explicitly on failure
  • you want inspectable lock state on disk

Using an explicit list like GPU_IDS="0 2 5 7" also keeps the scheduling policy obvious.

18.10 Mental model: one token = one allowed seat

Suppose:

export GPU_IDS="0 1 2 3"
export JOBS_PER_GPU=2

Then the allowed seats are conceptually:

  • GPU 0 → slot 1, slot 2
  • GPU 1 → slot 1, slot 2
  • GPU 2 → slot 1, slot 2
  • GPU 3 → slot 1, slot 2

So the total concurrent GPU-backed workers is:

4 GPUs × 2 jobs per GPU = 8 total GPU slots

A job may run only after it claims one of those eight slots.

18.11 Launcher pattern

#!/usr/bin/env bash
set -euo pipefail

mkdir -p logs

export RUN_TAG="gpu_batch"
export GPU_IDS="0 1 2 3"          # explicit GPU pool
export JOBS_PER_GPU=2             # 1, 2, or any other integer K
export WORKER="./worker.sh"       # generic executable entry point
export LOCK_DIR="${TMPDIR:-/tmp}/${RUN_TAG}_gpu_locks"

mkdir -p "$LOCK_DIR"

# Only clear old locks if you are certain no earlier batch is still using them.
rm -rf "$LOCK_DIR"/*

set -- $GPU_IDS
GPU_COUNT=$#
TOTAL_GPU_SLOTS=$(( GPU_COUNT * JOBS_PER_GPU ))

# Preview first
parallel --dry-run --jobs "${TOTAL_GPU_SLOTS}" \
  --env RUN_TAG,GPU_IDS,JOBS_PER_GPU,WORKER,LOCK_DIR \
  'JOB_ID={} /bin/bash ./gpu_lock_wrapper.sh' \
  ::: $(seq 1 100)

# Real run
nohup parallel --line-buffer --jobs "${TOTAL_GPU_SLOTS}" \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,GPU_IDS,JOBS_PER_GPU,WORKER,LOCK_DIR \
  'JOB_ID={} /bin/bash ./gpu_lock_wrapper.sh' \
  ::: $(seq 1 100) \
  > "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &

echo $! > "logs/${RUN_TAG}.pid"

A good default is to set:

--jobs = number_of_GPUs × JOBS_PER_GPU

That way, the outer GNU Parallel concurrency matches the number of GPU seats you actually intend to allow.

If --jobs is larger than the number of available GPU slots, the pattern is still correct, but extra wrappers will simply wait in the lock-acquisition loop.

18.12 GPU lock wrapper

#!/usr/bin/env bash
# File: gpu_lock_wrapper.sh
set -euo pipefail

job_id="${JOB_ID:?JOB_ID is required}"
worker="${WORKER:?WORKER is required}"

GPU_ID=""
GPU_SLOT=""
LOCK_PATH=""
TAG_PATH=""

cleanup() {
  if [ -n "${LOCK_PATH}" ]; then
    rm -rf "${LOCK_PATH}" 2>/dev/null || true
  fi
  if [ -n "${TAG_PATH}" ]; then
    rm -f "${TAG_PATH}" 2>/dev/null || true
  fi
}
trap cleanup EXIT INT TERM HUP

while [ -z "${GPU_ID}" ]; do
  for gpu in $GPU_IDS; do
    for slot in $(seq 1 "${JOBS_PER_GPU}"); do
      candidate="${LOCK_DIR}/gpu${gpu}_slot${slot}.lock"

      # mkdir is used because directory creation is atomic.
      if mkdir "${candidate}" 2>/dev/null; then
        GPU_ID="${gpu}"
        GPU_SLOT="${slot}"
        LOCK_PATH="${candidate}"
        TAG_PATH="${LOCK_DIR}/job_${job_id}_gpu${GPU_ID}_slot${GPU_SLOT}.tag"

        printf 'gpu=%s\nslot=%s\n' "${GPU_ID}" "${GPU_SLOT}" > "${TAG_PATH}"
        break 2
      fi
    done
  done

  if [ -z "${GPU_ID}" ]; then
    sleep 1
  fi
done

echo "start job_id=${job_id} gpu=${GPU_ID} slot=${GPU_SLOT}" >&2

# The worker should read JOB_ID, GPU_ID, and GPU_SLOT from the environment.
# CUDA_VISIBLE_DEVICES is shown here because it is a common GPU selector.
CUDA_VISIBLE_DEVICES="${GPU_ID}" \
GPU_ID="${GPU_ID}" \
GPU_SLOT="${GPU_SLOT}" \
JOB_ID="${job_id}" \
"${worker}"

echo "done job_id=${job_id} gpu=${GPU_ID} slot=${GPU_SLOT}" >&2

This pattern is task agnostic:

  • the worker can be Python, R, Bash, or anything else executable
  • the GPU pool can be contiguous or non-contiguous
  • changing JOBS_PER_GPU changes the policy from 1 to 2 to K jobs per GPU without rewriting the allocator

If your runtime uses a different GPU-selection variable, export that instead of CUDA_VISIBLE_DEVICES.

18.13 Why mkdir is used for the lock

A plain semaphore limits how many jobs may proceed, but not which GPU each job received.

The lock-directory approach solves both problems at once:

  • gpu0_slot1.lock means one seat on GPU 0 is occupied
  • gpu0_slot2.lock means a second seat on GPU 0 is occupied
  • if both exist, GPU 0 is full when JOBS_PER_GPU=2

The important property is that creating the directory is atomic, so only one contender can successfully claim the same slot.

18.14 Monitoring the lock state

find "${LOCK_DIR}" -maxdepth 1 -type d -name 'gpu*_slot*.lock' | sort
grep -H . "${LOCK_DIR}"/job_*.tag 2>/dev/null || true
tail -f "logs/${RUN_TAG}.out"
column -t < "logs/${RUN_TAG}_joblog.tsv" | less -S

These show:

  • which GPU slots are currently occupied
  • which jobs currently claim which GPUs
  • the batch-wide stdout/stderr stream
  • the GNU Parallel job log

18.15 Failure handling and stale locks

The trap handles normal completion and common termination signals, so locks are released when a worker exits in ordinary ways.

Two cases still require manual cleanup:

  • kill -9
  • machine crash or sudden power loss

In those cases, the shell cannot run cleanup code, so stale lock directories may remain behind.

A safe cleanup sequence is:

  1. verify the old batch is truly gone
  2. inspect the lock directory
  3. remove stale lock directories only after you are sure nothing still owns them

Using a dedicated LOCK_DIR per batch, or at least per host, makes this easier to reason about.

18.16 Practical guidance for GPU locks

Use this pattern when:

  • jobs must learn a specific GPU ID
  • you want 1, 2, or K jobs per GPU
  • job durations are uneven
  • you want explicit, inspectable scheduler state

A few rules keep it reliable:

  • keep GPU_IDS explicit
  • keep --jobs aligned with the total number of GPU slots
  • treat JOBS_PER_GPU as a policy limit, not a performance guarantee
  • increase JOBS_PER_GPU only when memory use and contention remain acceptable
  • keep the lock directory on a filesystem where mkdir is atomic and visible to all contenders in the same pool
  • on multi-node runs, use a separate lock namespace per host unless the GPUs are intentionally managed as one shared pool

In short:

  • use sem for anonymous gates
  • use a GPU lock allocator when workers must acquire a named GPU and you want a clear 1, 2, or K jobs-per-GPU policy

19. Advanced: Multi-node jobs over SSH

GNU Parallel can distribute jobs across machines directly with --sshlogin / -S or --sshloginfile / --slf, instead of making you hand-write nested ssh ... loops.

That is usually cleaner, easier to log, and easier to scale.

19.1 Preconditions

Before doing multi-node jobs, make sure:

  • you can SSH from the launch machine to each worker without interactive prompts
  • GNU Parallel is installed on the remote machines
  • python3 exists on each remote machine, or you use an absolute interpreter path
  • you know whether the project directory is shared across machines

19.2 First sanity checks

Start with these:

parallel -S user@node1,user@node2 --nonall hostname
parallel -S user@node1,user@node2 --nonall 'which python3'
parallel -S user@node1,user@node2 --nonall 'parallel --version | head -n1'

If these do not work cleanly, do not launch the real batch yet.

19.3 Shared-filesystem setup

The simplest multi-node case is when every machine sees the same project path.

Example node file:

2/:
1/user@node1
1/user@node2

Interpretation:

  • : means the local machine
  • 2/: gives the local machine two slots
  • 1/user@node1 gives node1 one slot
  • 1/user@node2 gives node2 one slot

So the total concurrency is 4.

Now launch:

export RUN_TAG="cluster100"
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

parallel --slf nodes.txt \
  --workdir . \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100)

Why this is the clean pattern:

  • GNU Parallel distributes jobs itself
  • the job log stays centralized on the launch machine
  • --workdir . tells GNU Parallel to use the caller’s current working directory on the remote hosts, with special handling when the directory is under your home directory
  • --env explicitly copies the exported variables, which is especially useful for remote execution

19.4 Why --sshloginfile is better than hand-written ssh loops

Using --slf nodes.txt or -S host1,host2 is often better because:

  • GNU Parallel knows which host ran each job
  • slot counts are managed in one place
  • file transfer features integrate cleanly
  • the command itself stays focused on the worker

The more nodes you add, the more valuable that becomes.

19.5 If the filesystem is not shared

If remote machines do not see the same project directory, use GNU Parallel’s transfer features.

The most important ones are:

  • --basefile file: copy a common file to each remote host before the first job
  • --transferfile file: copy a per-job input file
  • --return file: pull a result file back to the launch machine
  • --cleanup: remove transferred files after the job
  • --trc file: shorthand for transfer, return, cleanup in file-based workflows

Example pattern for file-based jobs:

find inputs -name '*.json' | parallel --slf nodes.txt \
  --workdir ... \
  --basefile worker.py \
  --transferfile {} \
  --return '{/.}.out' \
  --cleanup \
  'python3 -u ./worker.py {} > {/.}.out'

Interpretation:

  • worker.py is copied to each node before the first job
  • each input file is copied to the node that will process it
  • one output file per input is copied back
  • temporary transferred files are cleaned up

If your job inputs are integers rather than files, a common pattern is:

  • ship the worker once with --basefile
  • keep inputs as replacement strings like JOB_ID={}
  • return named outputs with --return

19.6 Environment handling on remote nodes

For remote jobs, prefer a short explicit env list:

--env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS

If you eventually need shell functions, aliases, or non-exported variables on remote nodes, GNU Parallel also supports env_parallel. For simple worker-style launchers, --env is usually enough and easier to reason about.

19.7 Heterogeneous cluster advice

If your machines are not identical:

  • use an absolute Python path if needed
  • put slot counts in nodes.txt
  • give slower machines fewer slots
  • keep thread caps at 1 unless you intentionally want hybrid parallelism

For example:

2/user@fast-node
1/user@slow-node
1/:

This is usually better than pretending all nodes should carry the same load.

19.8 Remote wrapper pattern

If the remote environment needs activation, wrap it explicitly in a shell that supports your activation logic:

parallel --slf nodes.txt \
  --workdir . \
  --env RUN_TAG \
  'bash -lc "source ~/.bashrc && conda activate myenv && JOB_ID={} SLOT={%} python3 -u ./worker.py"' \
  ::: $(seq 1 100)

That is safer than assuming the default remote shell understands source or your environment manager.

19.9 A clean four-slot, two-node example

If you want a compact concrete template for 100 jobs over two remote machines plus local capacity:

# nodes.txt
2/:
1/user@node1
1/user@node2
mkdir -p logs

export RUN_TAG="cluster100"
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

parallel --slf nodes.txt \
  --workdir . \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100)

This gives you:

  • 100 total jobs
  • 4 total concurrent slots across the node pool
  • one centralized job log
  • explicit env propagation
  • the same worker contract as the single-machine version

20. Final recommended command

If you want one command that best matches the single-machine macOS use case, use this:

mkdir -p logs

export RUN_TAG="mac100"
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1

nohup parallel --line-buffer --jobs 4 \
  --joblog "logs/${RUN_TAG}_joblog.tsv" \
  --delay 0.2 \
  --env RUN_TAG,OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS,VECLIB_MAXIMUM_THREADS,NUMEXPR_NUM_THREADS \
  'JOB_ID={} SLOT={%} python3 -u ./worker.py' \
  ::: $(seq 1 100) \
  > "logs/${RUN_TAG}.out" 2> "logs/${RUN_TAG}.err" &

echo $! > "logs/${RUN_TAG}.pid"

This gives you:

  • 100 total jobs
  • 4 concurrent Python workers
  • shared batch context via RUN_TAG
  • per-job context via JOB_ID and SLOT
  • reproducible logs
  • a macOS-safe default for CPU-heavy Python work

21. Bottom line

For clean, high-performance GNU Parallel workflows for Python:

  1. export shared run-level variables once
  2. cap math-library threads to 1
  3. use parallel --jobs N --joblog ... --delay ...
  4. assign per-job env vars inline with JOB_ID={} and SLOT={%}
  5. keep the Python worker as stateless as practical
  6. write per-job outputs and keep a global job log
  7. use sem --id ... --fg only around narrow critical sections
  8. use a real lock/allocator when a worker must learn a specific resource identity
  9. for multi-node jobs, prefer --slf or -S over hand-written nested ssh
  10. when the filesystem is not shared, use --basefile, --transferfile, --return, and --cleanup

Community

Sign up or log in to comment