Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities
Abstract
A novel reinforcement learning approach called ARM addresses entropy collapse in LLM reasoning by equilibrating confidence levels across correct responses through dynamic reward shaping.
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.
Community
When using reinforcement learning (RL) to enhance the reasoning capabilities of large language models (e.g., GRPO), models often suffer from mode collapse and entropy collapse. Specifically, the model gradually becomes overconfident and tends to repeatedly generate a single high-reward reasoning trajectory, while ceasing to explore other potentially correct solution paths. This behavior significantly reduces the diversity of generated outputs, thereby limiting the performance of multi-sample evaluation metrics such as Pass@K.
To address this issue, we propose ProGRPO from a probabilistic modeling perspective. By assigning more balanced probability mass to all feasible reasoning trajectories, our method effectively alleviates entropy collapse. As a result, ProGRPO improves performance under multi-sample settings (Pass@K) while maintaining strong single-sample reasoning performance.
Cross-domain validation. The proposed method demonstrates consistent effectiveness not only on mathematical reasoning tasks (AIME, MATH) but also on code generation tasks (LiveCodeBench), providing strong evidence for the robustness and generality of the approach.
Low-Probability Token Normalization. The sentence-level average probability can obscure the uncertainty of critical reasoning steps, because in long sequence generation, the majority of tokens are typically predicted with very high confidence (often above 90%), which dominates the averaging process and masks the few low-probability but decision-critical tokens.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper