YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

PIPer Stage 2 RL - Final Checkpoint

100% pass@5 on EnvBench evaluation set!

This model is the final checkpoint from a 2-stage training pipeline for Python environment setup tasks.

Model Description

  • Base Model: Qwen3-8B-am
  • Training Pipeline:
    • Stage 1: Supervised Fine-Tuning on 2,250 ShareGPT conversations
    • Stage 2: Reinforcement Learning with PPO on 228 EnvBench samples (40 epochs)
  • Hardware: 8x NVIDIA H200 GPUs
  • Training Time: ~3 hours total

Performance

Metric Value
pass@5 (20-sample eval) 100% (20/20 problems)
Baseline (paper) 19.4%
Baseline (reproduction) 30%
Improvement +70 percentage points

Training Data

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "PIPer-Stage2-RL-Final",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("PIPer-Stage2-RL-Final")

# Format prompt
messages = [{
    "role": "user",
    "content": "Your task is to generate a bash script that will set up a Python development environment..."
}]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=4096, temperature=0.8, top_p=0.95)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Stage 1 Configuration

  • Dataset: ShareGPT conversations
  • Batch Size: 256 (8 GPUs ร— 32 samples per GPU)
  • Learning Rate: 2e-5
  • Epochs: 3
  • Sequence Length: 4096
  • Training Steps: 24

Stage 2 Configuration

  • Algorithm: PPO (Proximal Policy Optimization)
  • Dataset: EnvBench environment setup problems
  • Batch Size: 128
  • Reward Function: Strict shellcheck validation
  • Epochs: 40
  • Sequence Length: 8192
  • Training Steps: 40

Evaluation Results

Evaluated on 20 problems from EnvBench test set with pass@5 metric (5 samples per problem):

  • 20/20 problems passed (100% success rate)
  • Most problems achieved 5/5 correct samples
  • Strong consistency across samples

Sample Reward Distributions

  • Problem 1: [1.00, 1.00, 1.00, 1.00, 1.00] โœ“
  • Problem 2: [1.00, 1.00, 1.00, 1.00, 1.00] โœ“
  • Problem 3: [1.00, 1.00, 1.00, 1.00, -1.00] โœ“
  • ...

Architecture

  • Framework: veRL (Versatile Reinforcement Learning)
  • Distribution: FSDP (Fully Sharded Data Parallel)
  • Inference: vLLM 0.8.4 (hybrid FSDP+vLLM mode)
  • Attention: Flash Attention 2

Checkpoints

Citation

Based on the PIPer paper:

@article{piper2025,
  title={PIPer: Automated Python Environment Setup with Reinforcement Learning},
  author={...},
  journal={arXiv preprint},
  year={2025}
}

License

Same as base model (Qwen3-8B-am)

Acknowledgments

  • JetBrains Research for the PIPer codebase and EnvBench dataset
  • Qwen team for the base model
  • veRL team for the training framework
Downloads last month
28
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support