Papers
arxiv:2602.03143

Self-Hinting Language Models Enhance Reinforcement Learning

Published on Feb 3
· Submitted by
Baohao Liao
on Feb 5
Authors:
,
,
,
,

Abstract

SAGE is an on-policy reinforcement learning framework that enhances GRPO by injecting self-hints during training to increase outcome diversity under sparse rewards, improving alignment of large language models.

AI-generated summary

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt x, the model samples a compact hint h (e.g., a plan or decomposition) and then generates a solution τ conditioned on (x,h). Crucially, the task reward R(x,τ) is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set h=varnothing and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

Community

Paper submitter
edited 1 day ago

RL for LLMs often stalls under sparse rewards — especially with GRPO, where whole rollout groups get identical 0 rewards and learning just… dies.

💡 SAGE fixes this with a simple but powerful idea:
👉 Let the model give itself hints during training.

How it works:

  • The model samples a compact hint (plan / decomposition) before solving
  • Rewards stay unchanged (same verifier, same objective)
  • Hints only reshape sampling, preventing advantage collapse
  • At test time? No hints at all. Clean deployment.

🔥 Why it matters:

  • Turns dead-end prompts into useful learning signals
  • Acts as an adaptive curriculum driven by the model itself
  • Stays fully on-policy (no external teachers required)

📊 Results across 6 benchmarks & 3 LLMs over GRPO:

  • +2.0 on Llama-3.2-3B
  • +1.2 on Qwen2.5-7B
  • +1.3 on Qwen3-4B

Sometimes the best teacher is… yourself 😌

Code: https://github.com/BaohaoLiao/SAGE
Slide by NotebookLM:

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.03143 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.03143 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.03143 in a Space README.md to link it from this page.

Collections including this paper 3