This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
翻译:本文研究了具有可验证奖励的强化学习(RLVR)中的探索-利用权衡,该框架旨在提升大型语言模型(LLM)的推理能力。近期研究表明,RLVR可通过两种看似矛盾的机制激发LLM强大的数学推理能力:虚假奖励(通过奖励与真实答案无关的结果来抑制利用行为)和熵最小化(通过推动模型产生更自信、确定性的输出来抑制探索行为)。这揭示了一个令人困惑的动态现象:抑制利用和抑制探索均能提升推理性能,但调和这两种效应的内在原理仍不明确。我们聚焦于两个核心问题:(i)策略熵如何与性能相关联;(ii)虚假奖励是否通过裁剪偏差与模型污染的相互作用产生增益。实验结果表明,在虚假奖励作用下,裁剪偏差会降低策略熵,从而产生更自信、确定性的输出,而仅靠熵最小化不足以带来性能提升。我们进一步提出奖励失准模型,以解释为何虚假奖励能在污染环境之外提升性能。本研究阐明了虚假奖励获益的内在机制,并为更有效的RLVR训练提供了理论依据。