Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model's final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct demonstrate that our approach can identify influential reasoning tokens and enable training more transparent reasoning models.
翻译:大型语言模型(LLMs)日益依赖思维链(CoT)推理来解决复杂任务。然而,确保推理轨迹既有助于又忠实地反映模型最终答案背后的过程,而非仅仅伴随答案,仍然具有挑战性。我们提出AtManRL方法,利用可微分的注意力操作,通过强化学习学习更忠实的推理。通过训练一个加性注意力掩码来识别CoT中对生成正确答案至关重要的标记,我们推导出一个显著性奖励信号,鼓励模型生成真正影响其最终预测的推理轨迹。我们将这一显著性奖励与GRPO框架中的基于结果的奖励相结合,以共同优化正确性和可解释性。在Llama-3.2-3B-Instruct上对GSM8K和MMLU的实验表明,我们的方法能够识别有影响力的推理标记,并支持训练更具透明性的推理模型。