Spaces:
Running
Training Guide β RhythmEnv GRPO
What we're training
A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly.
The goal is not to teach the model the rules of the environment β a capable LLM already understands them from the prompt. The goal is to calibrate a small model to do online behavioral inference: read who you're helping from how the environment responds, not from what it tells you.
Stack
| Component | Choice |
|---|---|
| Model | unsloth/Qwen2.5-3B-Instruct |
| Quantization | 4-bit NF4 via Unsloth |
| LoRA rank | 4 |
| Training algorithm | GRPO (TRL 0.22.2) |
| Hardware | Free Colab T4 (~3 hours for 500 steps) |
Three-layer reward stack
Each training step scores four candidate completions per prompt across three reward functions:
| Layer | Function | Signal | Pass | Fail |
|---|---|---|---|---|
| 1 | format_valid |
Is the output a parseable action name? | +1.0 | β2.0 |
| 2 | action_legal |
Is it one of the 10 valid ActionType values? |
+0.5 | β1.0 |
| 3 | env_reward |
Real reward from stepping the environment | varies | β3.0 |
env_reward uses seed-based episode replay: the dataset stores seed, step_index, and action_history alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action β the reward cannot be fabricated.
Key config choices
GRPOConfig(
beta=0.01, # KL penalty β default 0.04 caused explosion to kl=10731 at step 205
max_completion_length=16, # Action names are β€15 chars; prevents verbose drift
learning_rate=2e-4,
num_generations=4, # 4 candidates per prompt β enough variance for GRPO signal
max_steps=500,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
)
beta=0.01 is the critical fix from the first training run. The default value caused the policy to drift so far from the reference model that completion length jumped from 4 tokens to 368 tokens, saturating the max. max_completion_length=16 provides a hard cap as a second safeguard.
Dataset
Generated from 200 simulated episodes using a mixed strategy (heuristic + random actions) across all three profiles. Each sample is one step:
{
"prompt": [system_msg, user_observation],
"seed": int, # episode seed β deterministic profile + events
"step_index": int, # which step in the episode
"action_history": list, # actions taken before this step
}
The dataset gives the model exposure to all three profiles and a range of meter states. Mixed strategy (not pure heuristic) ensures the model sees suboptimal states to learn recovery from.
Baselines
Established before training on 5 episodes Γ 3 profiles:
| Strategy | Introvert | Extrovert | Workaholic |
|---|---|---|---|
| Random | ~0.65 | ~0.70 | ~0.65 |
| Heuristic | ~0.78 | ~0.76 | ~0.82 |
The heuristic baseline uses observable rules only (sleep when vitality is low, meditate when serenity is low, socialise when connection drops). It cannot differentiate profiles.
A trained agent should beat the heuristic on at least 2 of 3 profiles, with qualitatively different action sequences per profile β the introvert's week should look nothing like the workaholic's.
How to run
Open training/RhythmEnv_GRPO_Training.ipynb in Colab with a T4 GPU runtime.
Run cells in order:
- Install dependencies
- Clone repo from HF Space
- Verify environment
- Run baseline evaluation (saves
baseline_results) - Generate dataset
- Load model (Qwen 2.5-3B + LoRA)
- Setup reward functions
- Configure training (
beta=0.01,max_completion_length=16) - Train (
trainer.train()) - Save model
- Generate training plots
- Evaluate trained model
- Generate comparison chart (
baseline_vs_trained.png)
Expected training behaviour
Healthy run: completion_length stays at 3β16 tokens throughout, KL stays below 1.0, mean reward climbs from ~1.5 toward ~3.0 over 500 steps.
Warning signs: completion_length spiking above 50, clipped_ratio approaching 1.0, KL above 5.0. If any of these appear, the beta=0.01 fix is not being applied.
Output artifacts
After a successful run, download these and commit to the repo:
plots/training_loss.png β loss curve across 500 steps
plots/reward_curve.png β mean reward with Β±1 std band
plots/baseline_vs_trained.png β comparison bar chart (random / heuristic / trained)
plots/eval_results.json β raw per-episode scores