Reinforcement Learning with Verifiable Rewards (RLVR) is a central paradigm for turning large language models (LLMs) into reliable problem solvers, especially in logic-heavy domains. Despite its empirical success, it remains unclear whether RLVR elicits novel capabilities or merely sharpens the distribution over existing knowledge. We study this by formalizing over-sharpening, a phenomenon where the policy collapses onto limited modes, suppressing valid alternatives. At a high level, we discover finite-batch updates intrinsically bias learning toward sampled modes, triggering a collapse that propagates globally via semantic coupling. To mitigate this, we propose inverse-success advantage calibration to prioritize difficult queries and distribution-level calibration to diversify sampling via a memory network. Empirical evaluations validate that our strategies can effectively improve generalization.
翻译:可验证奖励强化学习(RLVR)是将大语言模型(LLM)转变为可靠问题求解器的核心范式,尤其在逻辑密集型领域。尽管其经验上取得了成功,但RLVR究竟是激发了新的能力,还是仅仅锐化了现有知识的分布,目前尚不明确。我们通过形式化"过度锐化"现象来研究此问题,即策略坍缩到有限的模式上,抑制了有效的替代方案。在高层次上,我们发现有限批次更新本质上使学习偏向于已采样的模式,从而引发一种通过语义耦合全局传播的坍缩。为缓解此问题,我们提出了逆成功优势校准以优先处理困难查询,以及分布级校准以通过记忆网络实现采样多样化。实证评估验证了我们的策略能有效提升泛化能力。