PIPer Qwen3-8B RL Environment Setup
This model is a Qwen3-8B fine-tuned with PPO (Proximal Policy Optimization) for automated environment setup tasks.
Model Details
- Base Model: Qwen/Qwen3-8B
- Training Method: Reinforcement Learning (PPO)
- Dataset: JetBrains-Research/PIPer-envbench-zeroshot-rl (742 samples)
- Training Infrastructure: 8x H200 GPUs with FSDP
- Training Time: 1.75 hours (15 epochs)
- Checkpoint: global_step_15
Performance
EnvBench Evaluation (20 problems subset)
- Pass@5: 30.0% (6/20 problems)
- Validation Reward: 0.601
- Script Extraction Rate: 34%
Comparison with Paper
- Paper (PIPer): 19.4% pass@5
- This Model: 30.0% pass@5
- Improvement: +10.6 percentage points (54% relative improvement)
Training Configuration
- Reward Function: Shellcheck-based rule reward
- Algorithm: Reinforce++ with tool masking
- Learning Rate: Cosine schedule (3e-6 peak)
- Batch Size: 128
- Max Sequence Length: 32,000 tokens
- FSDP: Fully Sharded Data Parallel training
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"rs545837/PIPer-Qwen3-8B-RL-envsetup",
trust_remote_code=True,
torch_dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained(
"rs545837/PIPer-Qwen3-8B-RL-envsetup",
trust_remote_code=True
)
# Example prompt
messages = [
{"role": "user", "content": "Write a bash script to set up the environment for this Python project..."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Citation
Based on the PIPer paper:
@article{piper2025,
title={PIPer: On-Device Environment Setup via Online Reinforcement Learning},
year={2025}
}
License
Apache 2.0
- Downloads last month
- 15