Attention is Just Another Name for Coupling?: A Fast-Slow ODE Perspective on Hierarchical Pretraining

Causal self-attention is a coupling mechanism: each token's hidden state is updated by a learned mixture of preceding tokens at the same timescale. This paper asks whether a second, temporally slower coupling-a slow sub-system operating on a temporally-downsampled view of the sequence and fed back into the fast path through a zero-initialised gate-complements it. The question is framed in the language of singularly perturbed ordinary differential equations (ODEs), where the fast variable $x$ evolves at the token rate, the slow variable $y$ evolves at one update per $P$ tokens, and the timescale ratio $\varepsilon = 1/P$ is enforced structurally by causal block-mean pooling. The paper instantiates the fast-slow ODE formalism as a concrete neural network: a fast path of standard causal attention over $T$ tokens, a slow path of full attention over $T/P$ pooled tokens ($P^2 \times$ cheaper per layer), and a zero-initialised additive gate. In addition, under a linear-generator assumption on the fast dynamics, we prove that the equilibrium manifold $x = φ(y)$ is exactly the master-equation (ME) stationary distribution $p_{\mathrm{st}}(y)$; in that regime a learned MLP $φ_θ(y)$ is a variational approximation of it (the trained block is not a generator, so this identity is the structured limit, not a claim about the network as trained). Empirically, at $500$k tokens the coupling is neutral -- the gate stays closed and the coupled and frozen ablations are within run-to-run noise -- at a wall-clock cost comparable to a dense baseline. The contribution is the precise, gap-marked mapping itself, not a performance gain.

翻译：因果自注意力是一种耦合机制：每个令牌的隐藏状态通过同一时间尺度上前面令牌的学习混合进行更新。本文提出疑问：是否存在一个第二个时间上更慢的子系统——对一个序列的时间降采样视图进行操作，并通过一个零初始化的门反馈到快速路径中——来补充这一机制？该问题以奇异摄动常微分方程的语言进行框架构建，其中快变量$x$以令牌速率演化，慢变量$y$每$P$个令牌更新一次，并且时间尺度比率$\varepsilon = 1/P$通过因果块均值池化在结构上强制执行。本文将快-慢常微分方程形式系统实例化为一个具体的神经网络：一个在$T$个令牌上运行的标准因果注意力的快速路径，一个在$T/P$个池化令牌上运行的全注意力的慢速路径（每层计算成本降低$P^2$倍），以及一个零初始化的加性门。此外，在快动力学的线性生成器假设下，我们证明了平衡流形$x = \varphi(y)$恰好是主方程稳态分布$p_{\mathrm{st}}(y)$；在这种情况下，学习的MLP $\varphi_\theta(y)$是其变分近似（训练后的块不是生成器，因此这个恒等式是结构极限，而非关于训练后网络的声明）。实证上，在50万个令牌时，耦合是中性的——门保持关闭，耦合和冻结消融结果在运行间噪声范围内——且墙钟时间成本与稠密基线相当。本文的贡献在于精确的、标记间隙的映射本身，而非性能提升。