PIPer Qwen3-8B RL Environment Setup

This model is a Qwen3-8B fine-tuned with PPO (Proximal Policy Optimization) for automated environment setup tasks.

Model Details

  • Base Model: Qwen/Qwen3-8B
  • Training Method: Reinforcement Learning (PPO)
  • Dataset: JetBrains-Research/PIPer-envbench-zeroshot-rl (742 samples)
  • Training Infrastructure: 8x H200 GPUs with FSDP
  • Training Time: 1.75 hours (15 epochs)
  • Checkpoint: global_step_15

Performance

EnvBench Evaluation (20 problems subset)

  • Pass@5: 30.0% (6/20 problems)
  • Validation Reward: 0.601
  • Script Extraction Rate: 34%

Comparison with Paper

  • Paper (PIPer): 19.4% pass@5
  • This Model: 30.0% pass@5
  • Improvement: +10.6 percentage points (54% relative improvement)

Training Configuration

  • Reward Function: Shellcheck-based rule reward
  • Algorithm: Reinforce++ with tool masking
  • Learning Rate: Cosine schedule (3e-6 peak)
  • Batch Size: 128
  • Max Sequence Length: 32,000 tokens
  • FSDP: Fully Sharded Data Parallel training

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "rs545837/PIPer-Qwen3-8B-RL-envsetup",
    trust_remote_code=True,
    torch_dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained(
    "rs545837/PIPer-Qwen3-8B-RL-envsetup",
    trust_remote_code=True
)

# Example prompt
messages = [
    {"role": "user", "content": "Write a bash script to set up the environment for this Python project..."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

Based on the PIPer paper:

@article{piper2025,
  title={PIPer: On-Device Environment Setup via Online Reinforcement Learning},
  year={2025}
}

License

Apache 2.0

Downloads last month
15
Safetensors
Model size
8B params
Tensor type
BF16
·
Video Preview
loading