PIPer Qwen3-8B RL Environment Setup

This model is a Qwen3-8B fine-tuned with PPO (Proximal Policy Optimization) for automated environment setup tasks.

Model Details

Base Model: Qwen/Qwen3-8B
Training Method: Reinforcement Learning (PPO)
Dataset: JetBrains-Research/PIPer-envbench-zeroshot-rl (742 samples)
Training Infrastructure: 8x H200 GPUs with FSDP
Training Time: 1.75 hours (15 epochs)
Checkpoint: global_step_15

Performance

EnvBench Evaluation (20 problems subset)

Pass@5: 30.0% (6/20 problems)
Validation Reward: 0.601
Script Extraction Rate: 34%

Comparison with Paper

Paper (PIPer): 19.4% pass@5
This Model: 30.0% pass@5
Improvement: +10.6 percentage points (54% relative improvement)

Training Configuration

Reward Function: Shellcheck-based rule reward
Algorithm: Reinforce++ with tool masking
Learning Rate: Cosine schedule (3e-6 peak)
Batch Size: 128
Max Sequence Length: 32,000 tokens
FSDP: Fully Sharded Data Parallel training

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "rs545837/PIPer-Qwen3-8B-RL-envsetup",
    trust_remote_code=True,
    torch_dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained(
    "rs545837/PIPer-Qwen3-8B-RL-envsetup",
    trust_remote_code=True
)

# Example prompt
messages = [
    {"role": "user", "content": "Write a bash script to set up the environment for this Python project..."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

Based on the PIPer paper:

@article{piper2025,
  title={PIPer: On-Device Environment Setup via Online Reinforcement Learning},
  year={2025}
}

License

Apache 2.0

Downloads last month: 15

Safetensors

Model size

8B params

Tensor type

BF16

Video Preview

Reinforcement Learning