Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.
翻译:测试时强化学习(TTRL)已成为一种有前景的范式,用于实现大型推理模型(LRMs)的自演进,使其能够通过多数投票产生自诱导奖励,从而在未标注的测试输入上进行在线适应。然而,一个虚假但高频出现的未经验证的共识可能成为一种有偏差且被强化的奖励信号,导致错误的模式坍塌。我们通过T^3RL(测试时强化学习的工具验证)来解决这一失效模式,该方法将测试时工具验证引入奖励估计过程。具体而言,验证器使用外部工具(例如代码执行结果)作为证据,在验证感知的投票中提升已验证的探索轨迹的权重,从而为训练生成更可靠的伪标签。在多种数学难度数据集(MATH-500、AMC和AIME 2024)以及不同的骨干模型类型上,T^3RL相比TTRL均取得了显著提升,且在更困难的问题上增益更大。更广泛地说,T^3RL可被视为经过验证的在线数据合成方法,凸显了测试时工具验证作为稳定自演进过程的关键机制。