The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.
翻译:大语言模型的推理模式仍不透明,而强化学习通常对完整生成过程施加均匀奖励,模糊了关键步骤与常规步骤的界限。本研究将注意力机制定位为一种特权基底,它使大语言模型的内部逻辑可解读——不仅是计算过程的副产品,更是推理本身的机制蓝图。我们首先区分注意力头在局部聚焦与全局聚焦信息处理间的差异,揭示局部聚焦头在注意力矩阵对角线附近产生指示短语分块的锯齿状模式,而全局聚焦头则暴露那些对未来词元产生广泛下游影响的词元。我们通过两个指标对此进行形式化:1)窗口平均注意力距离,用于度量截断窗口内向后注意力的覆盖范围;2)未来注意力影响,通过计算词元从后续词元获得的平均注意力来量化其全局重要性。这些信号共同揭示了一种循环出现的预规划与锚定机制:模型首先执行长距离上下文参照以生成引导性词元,随后立即(或同时)产生组织后续推理的语义锚定词元。基于这些发现,我们提出三种新型强化学习策略,能够动态地对关键节点(预规划词元、锚定词元及其时序耦合)进行定向奖励分配,并在多种推理任务中展现出稳定的性能提升。通过将优化过程与模型内在推理节律对齐,我们旨在将不透明的优化转化为可操作的、结构感知的过程,这或许能为实现更透明、更高效的大语言模型推理优化迈出潜在的一步。