Reinforcement Learning with Verifiable Rewards (RLVR) has recently strengthened LLM reasoning, but its focus on final answer correctness leaves a critical gap: it does not ensure the robustness of the reasoning process itself. We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it, and treat reasoning as a form of meaning transfer that must survive truncation, reinterpretation, and continuation. Building on this principle, we introduce Reinforcement Learning with Transferable Reward (RLTR), which operationalizes robustness via transfer reward that tests whether a partial reasoning prefix from one model can guide a separate model to the correct answer. This encourages LLMs to produce reasoning that is stable, interpretable, and genuinely generalizable. Our approach improves sampling consistency while improving final answer accuracy, and it reaches comparable performance in substantially fewer training steps. For example, on MATH500, RLTR achieves a +3.6%p gain in Maj@64 compared to RLVR and matches RLVR's average accuracy with roughly 2.5x fewer training steps, providing both more reliable reasoning and significantly more sample efficient.
翻译:基于可验证奖励的强化学习(RLVR)近期增强了大型语言模型的推理能力,但其对最终答案正确性的关注存在一个关键缺陷:它无法保证推理过程本身的鲁棒性。我们采用一个简单的哲学观点:鲁棒的推理应在其产生者之外保持效用,并将推理视为一种意义迁移形式,必须能够经受截断、重新解释和延续的考验。基于这一原则,我们提出了基于可迁移奖励的强化学习(RLTR),该方法通过可迁移奖励来具体化鲁棒性——该奖励测试一个模型产生的部分推理前缀能否引导另一个独立模型得出正确答案。这促使大型语言模型生成稳定、可解释且真正具备泛化能力的推理过程。我们的方法在提升采样一致性的同时提高了最终答案准确率,并能以显著更少的训练步骤达到可比性能。例如在MATH500数据集上,RLTR在Maj@64指标上较RLVR提升+3.6个百分点,并以约2.5倍更少的训练步骤达到RLVR的平均准确率,在提供更可靠推理的同时实现了显著更高的样本效率。