Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a distribution pruning mechanism for non-majority rollout exploration and reward denoise, yielding a more informative and robust reward estimation. Extensive experiments on challenging reasoning benchmarks show that DARE improves optimization stability and final performance over recent baselines, achieving relative improvements of 25.3% on challenging AIME 2024 and 5.3% on AMC.
翻译:测试时强化学习(TTRL)使得大型语言模型(LLM)能够在无标注输入上进行自我改进,但其有效性关键取决于如何在无真实监督的情况下估计奖励信号。大多数现有的TTRL方法依赖于对多个生成轨迹进行多数投票(MV)来产生确定性奖励,这隐含地假设多数轨迹能提供可靠的学习信号。我们证明这一假设是脆弱的:MV将轨迹分布简化为单一结果,丢弃了关于非多数但正确的候选动作的信息,并导致系统性的奖励估计偏差。为解决此问题,我们提出了分布感知奖励估计(DARE),它将奖励估计从单一的多数结果转向完整的经验轨迹分布。DARE进一步通过探索奖励和分布剪枝机制来增强这种基于分布的奖励,以支持对非多数轨迹的探索和奖励去噪,从而产生信息更丰富、更稳健的奖励估计。在具有挑战性的推理基准上的大量实验表明,DARE相较于近期基线方法,优化稳定性和最终性能均有提升,在具有挑战性的AIME 2024上实现了25.3%的相对改进,在AMC上实现了5.3%的相对改进。