Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.
翻译:链式推理已成为大型语言模型解决复杂问题的标准范式。然而,近期研究揭示了推理跳跃泛化场景中的显著性能下降:当所需推理步数超出训练分布而底层算法保持不变时,模型表现急剧恶化。驱动这一失败的内部机制仍鲜有理解。本文对来自多个领域的任务开展系统研究,发现错误集中于少数关键错误类型的词元位置,而非均匀分布。深入分析表明,这些词元级错误预测源于内部竞争机制:特定注意力头(称为错误处理头)通过放大错误推理轨迹并抑制正确轨迹来打破平衡。值得注意的是,在推理过程中移除单个错误处理头通常能恢复正确预测。基于这些发现,我们提出测试时推理修正方法——一种轻量级干预策略,可在推理过程中动态识别并停用错误处理头。跨不同任务与大型语言模型的广泛实验表明,该方法能持续提升推理跳跃泛化能力,充分验证了其有效性与潜力。