YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
PIPer Stage 2 RL - Final Checkpoint
100% pass@5 on EnvBench evaluation set!
This model is the final checkpoint from a 2-stage training pipeline for Python environment setup tasks.
Model Description
- Base Model: Qwen3-8B-am
- Training Pipeline:
- Stage 1: Supervised Fine-Tuning on 2,250 ShareGPT conversations
- Stage 2: Reinforcement Learning with PPO on 228 EnvBench samples (40 epochs)
- Hardware: 8x NVIDIA H200 GPUs
- Training Time: ~3 hours total
Performance
| Metric | Value |
|---|---|
| pass@5 (20-sample eval) | 100% (20/20 problems) |
| Baseline (paper) | 19.4% |
| Baseline (reproduction) | 30% |
| Improvement | +70 percentage points |
Training Data
Stage 1: PIPer-SFT-ShareGPT-Data
- 2,250 training conversations
- 250 validation conversations
Stage 2: PIPer-EnvBench-Data
- 228 environment setup problems (training)
- 96 environment setup problems (test)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"PIPer-Stage2-RL-Final",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("PIPer-Stage2-RL-Final")
# Format prompt
messages = [{
"role": "user",
"content": "Your task is to generate a bash script that will set up a Python development environment..."
}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=4096, temperature=0.8, top_p=0.95)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Training Details
Stage 1 Configuration
- Dataset: ShareGPT conversations
- Batch Size: 256 (8 GPUs ร 32 samples per GPU)
- Learning Rate: 2e-5
- Epochs: 3
- Sequence Length: 4096
- Training Steps: 24
Stage 2 Configuration
- Algorithm: PPO (Proximal Policy Optimization)
- Dataset: EnvBench environment setup problems
- Batch Size: 128
- Reward Function: Strict shellcheck validation
- Epochs: 40
- Sequence Length: 8192
- Training Steps: 40
Evaluation Results
Evaluated on 20 problems from EnvBench test set with pass@5 metric (5 samples per problem):
- 20/20 problems passed (100% success rate)
- Most problems achieved 5/5 correct samples
- Strong consistency across samples
Sample Reward Distributions
- Problem 1: [1.00, 1.00, 1.00, 1.00, 1.00] โ
- Problem 2: [1.00, 1.00, 1.00, 1.00, 1.00] โ
- Problem 3: [1.00, 1.00, 1.00, 1.00, -1.00] โ
- ...
Architecture
- Framework: veRL (Versatile Reinforcement Learning)
- Distribution: FSDP (Fully Sharded Data Parallel)
- Inference: vLLM 0.8.4 (hybrid FSDP+vLLM mode)
- Attention: Flash Attention 2
Checkpoints
- Stage 1 SFT: PIPer-Stage1-SFT-ShareGPT
- Stage 2 RL (this model): PIPer-Stage2-RL-Final
Citation
Based on the PIPer paper:
@article{piper2025,
title={PIPer: Automated Python Environment Setup with Reinforcement Learning},
author={...},
journal={arXiv preprint},
year={2025}
}
License
Same as base model (Qwen3-8B-am)
Acknowledgments
- JetBrains Research for the PIPer codebase and EnvBench dataset
- Qwen team for the base model
- veRL team for the training framework
- Downloads last month
- 28
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support