Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $τ$ conditioned on $(x,h)$. Crucially, the task reward $R(x,τ)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.
翻译:组相对策略优化(GRPO)最近已成为将大语言模型与可验证目标对齐的一种实用方法。然而,在稀疏的终端奖励下,GRPO常常停滞不前,因为组内的多次尝试经常获得相同的奖励,导致相对优势崩溃以及更新消失。我们提出了带有特权监督的自提示对齐GRPO(SAGE),这是一种在策略强化学习框架,它在训练过程中注入特权提示,以在相同的终端验证器奖励下重塑尝试分布。对于每个提示$x$,模型采样一个紧凑的提示$h$(例如,一个计划或分解),然后生成一个以$(x,h)$为条件的解决方案$τ$。关键在于,任务奖励$R(x,τ)$保持不变;提示仅在有限采样下增加了组内结果的多样性,防止了GRPO优势在稀疏奖励下崩溃。在测试时,我们设置$h=\varnothing$并部署无提示策略,无需任何特权信息。此外,采样多样化的自提示作为一种自适应课程,比来自初始策略或更强外部模型的固定提示更能有效追踪学习者的瓶颈。在6个基准测试和3个大语言模型上的实验表明,SAGE始终优于GRPO,在Llama-3.2-3B-Instruct上平均提升+2.0,在Qwen2.5-7B-Instruct上平均提升+1.2,在Qwen3-4B-Instruct上平均提升+1.3。代码可在https://github.com/BaohaoLiao/SAGE获取。