rhythm_env / docs /training.md
InosLihka's picture
docs: reorganize β€” 25 files β†’ 4 focused docs
1a25a1a

Training Guide β€” RhythmEnv GRPO

What we're training

A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly.

The goal is not to teach the model the rules of the environment β€” a capable LLM already understands them from the prompt. The goal is to calibrate a small model to do online behavioral inference: read who you're helping from how the environment responds, not from what it tells you.


Stack

Component Choice
Model unsloth/Qwen2.5-3B-Instruct
Quantization 4-bit NF4 via Unsloth
LoRA rank 4
Training algorithm GRPO (TRL 0.22.2)
Hardware Free Colab T4 (~3 hours for 500 steps)

Three-layer reward stack

Each training step scores four candidate completions per prompt across three reward functions:

Layer Function Signal Pass Fail
1 format_valid Is the output a parseable action name? +1.0 βˆ’2.0
2 action_legal Is it one of the 10 valid ActionType values? +0.5 βˆ’1.0
3 env_reward Real reward from stepping the environment varies βˆ’3.0

env_reward uses seed-based episode replay: the dataset stores seed, step_index, and action_history alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action β€” the reward cannot be fabricated.


Key config choices

GRPOConfig(
    beta=0.01,               # KL penalty β€” default 0.04 caused explosion to kl=10731 at step 205
    max_completion_length=16, # Action names are ≀15 chars; prevents verbose drift
    learning_rate=2e-4,
    num_generations=4,        # 4 candidates per prompt β€” enough variance for GRPO signal
    max_steps=500,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
)

beta=0.01 is the critical fix from the first training run. The default value caused the policy to drift so far from the reference model that completion length jumped from 4 tokens to 368 tokens, saturating the max. max_completion_length=16 provides a hard cap as a second safeguard.


Dataset

Generated from 200 simulated episodes using a mixed strategy (heuristic + random actions) across all three profiles. Each sample is one step:

{
    "prompt": [system_msg, user_observation],
    "seed": int,              # episode seed β†’ deterministic profile + events
    "step_index": int,        # which step in the episode
    "action_history": list,   # actions taken before this step
}

The dataset gives the model exposure to all three profiles and a range of meter states. Mixed strategy (not pure heuristic) ensures the model sees suboptimal states to learn recovery from.


Baselines

Established before training on 5 episodes Γ— 3 profiles:

Strategy Introvert Extrovert Workaholic
Random ~0.65 ~0.70 ~0.65
Heuristic ~0.78 ~0.76 ~0.82

The heuristic baseline uses observable rules only (sleep when vitality is low, meditate when serenity is low, socialise when connection drops). It cannot differentiate profiles.

A trained agent should beat the heuristic on at least 2 of 3 profiles, with qualitatively different action sequences per profile β€” the introvert's week should look nothing like the workaholic's.


How to run

Open training/RhythmEnv_GRPO_Training.ipynb in Colab with a T4 GPU runtime.

Run cells in order:

  1. Install dependencies
  2. Clone repo from HF Space
  3. Verify environment
  4. Run baseline evaluation (saves baseline_results)
  5. Generate dataset
  6. Load model (Qwen 2.5-3B + LoRA)
  7. Setup reward functions
  8. Configure training (beta=0.01, max_completion_length=16)
  9. Train (trainer.train())
  10. Save model
  11. Generate training plots
  12. Evaluate trained model
  13. Generate comparison chart (baseline_vs_trained.png)

Expected training behaviour

Healthy run: completion_length stays at 3–16 tokens throughout, KL stays below 1.0, mean reward climbs from ~1.5 toward ~3.0 over 500 steps.

Warning signs: completion_length spiking above 50, clipped_ratio approaching 1.0, KL above 5.0. If any of these appear, the beta=0.01 fix is not being applied.


Output artifacts

After a successful run, download these and commit to the repo:

plots/training_loss.png         β€” loss curve across 500 steps
plots/reward_curve.png          β€” mean reward with Β±1 std band
plots/baseline_vs_trained.png   β€” comparison bar chart (random / heuristic / trained)
plots/eval_results.json         β€” raw per-episode scores