Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it.

翻译：基于可验证奖励的强化学习已成为提升大语言模型显式推理能力的通用方法，但仅通过最终答案正确性无法揭示推理轨迹是否忠实可靠、对使用该推理结果的模型是否真正有用。这种仅依赖结果的信号可能强化"结果正确但过程错误"的推理轨迹，通过奖励捷径高估推理提升效果，并在多步骤系统中传播有缺陷的中间状态。为此，我们提出TraceLift框架——一种将推理视为可消费中间产物的规划器-执行器训练框架。在规划器训练过程中，规划器生成带标签的推理过程，冻结的执行器将该推理转化为最终产物供验证器评估，同时基于执行器反馈的奖励信号塑造中间推理轨迹。该奖励将基于评分标准的推理奖励模型得分与同一冻结执行器上观测到的性能提升相乘，从而奖励既高质量又有用的推理轨迹。为使推理质量可直接学习，我们引入TRACELIFT-GROUPS数据集——基于数学与代码种子问题构建的、带评分标注的纯推理数据集。每个样本是同一问题的推理组，包含一条高质量参考轨迹和多个带局部扰动的合理错误轨迹，这些扰动在保持任务相关性的同时降低了推理质量或对解决方案的支持度。在代码与数学基准上的大量实验表明，这种基于执行器反馈的推理奖励优于仅采用执行结果训练的两阶段规划器-执行器系统，表明推理监督不仅要评估推理轨迹的表面合理性，更应关注其对下游模型的实际帮助效果。