FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
Abstract
FP8 quantization techniques for reinforcement learning with large language models address computational and memory bottlenecks through optimized rollout processes and mismatch mitigation strategies.
Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL introduces unique engineering and algorithmic challenges: policy weights change every step (requiring repeated quantization and weight synchronization into the inference engine) and low-precision rollouts can deviate from the higher-precision policy assumed by the trainer, causing train-inference mismatch and potential instability. This report presents a practical FP8 rollout stack for LLM RL, implemented in the veRL ecosystem with support for common training backends (e.g., FSDP/Megatron-LM) and inference engines (e.g., vLLM/SGLang). We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks via per-step QKV scale recalibration, and (iii) mitigate mismatch using importance-sampling-based rollout correction (token-level TIS/MIS variants). Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
Community
We have developed FP8 rollout features for the two frameworks, verl and NeMo-RL. In this report, we introduce the implementation solutions and a series of validation experiments conducted (covering both Dense and MoE models), with analyses performed from the perspectives of precision alignment and performance.
In the coming period, we will continue to explore and practice stable FP8 end-to-end training recipes for MoE models. Welcome to use these features and provide feedback.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow (2026)
- Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts (2026)
- RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs (2025)
- SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache (2026)
- Each Prompt Matters: Scaling Reinforcement Learning Without Wasting Rollouts on Hundred-Billion-Scale MoE (2025)
- Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning (2025)
- RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper