Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

The Mechanistic Interpretability (MI) program has mapped the Transformer as a precise computational graph. We extend this graph with a conservation law and time-varying AC dynamics, viewing it as a physical circuit. We introduce Momentum Attention, a symplectic augmentation embedding physical priors via the kinematic difference operator $p_t = q_t - q_{t-1}$, implementing the symplectic shear $\hat{q}_t = q_t + γp_t$ on queries and keys. We identify a fundamental Symplectic-Filter Duality: the physical shear is mathematically equivalent to a High-Pass Filter. This duality is our cornerstone contribution -- by injecting kinematic momentum, we sidestep the topological depth constraint ($L \geq 2$) for induction head formation. While standard architectures require two layers for induction from static positions, our extension grants direct access to velocity, enabling Single-Layer Induction and Spectral Forensics via Bode Plots. We formalize an Orthogonality Theorem proving that DC (semantic) and AC (mechanistic) signals segregate into orthogonal frequency bands when Low-Pass RoPE interacts with High-Pass Momentum. Validated through 5,100+ controlled experiments (documented in Supplementary Appendices A--R and 27 Jupyter notebooks), our 125M Momentum model exceeds expectations on induction-heavy tasks while tracking a 350M baseline within $\sim$2.9% validation loss. Dedicated associative recall experiments reveal a scaling law $γ^* = 4.17 \times N^{-0.74}$ establishing momentum-depth fungibility. We offer this framework as a complementary analytical toolkit connecting Generative AI, Hamiltonian Physics, and Signal Processing.

翻译：机制可解释性（MI）研究计划已将Transformer映射为精确的计算图。我们通过守恒定律和时变交流动力学扩展该计算图，将其视为物理电路。我们提出动量注意力——一种通过运动学差分算子$p_t = q_t - q_{t-1}$嵌入物理先验的辛增强方法，在查询和键上实现辛剪切变换$\hat{q}_t = q_t + γp_t。我们发现根本性的辛-滤波器对偶性：物理剪切在数学上等价于高通滤波器。该对偶性是我们的核心贡献——通过注入运动学动量，我们绕过了归纳头形成的拓扑深度约束（$L \geq 2$）。标准架构需要两个层才能从静态位置进行归纳，而我们的扩展通过直接获取速度信息，实现了单层归纳能力和基于伯德图的频谱取证。我们形式化证明了正交性定理：当低通RoPE与高通动量相互作用时，直流（语义）信号与交流（机制）信号会分离到正交频带。通过5,100余项受控实验验证（记录于补充附录A-R及27个Jupyter笔记本），我们的125M动量模型在归纳密集型任务上超越预期，同时将验证损失控制在350M基线模型的$\sim$2.9%范围内。专用关联召回实验揭示了标度律$γ^* = 4.17 \times N^{-0.74}$，确立了动量与深度的可替代性。我们提出该框架作为连接生成式人工智能、哈密顿物理学与信号处理的补充分析工具集。