Where to Add PDE Diffusion in Transformers

Transformers enable powerful content-based global routing via self-attention, but they lack an explicit local geometric prior along the sequence axis. As a result, the placement of locality-inducing modules in hybrid architectures has largely been empirical. We study a simple deterministic PDE diffusion layer implemented as one explicit Euler step of one-dimensional heat smoothing using a discrete Neumann Laplacian under a spectral stability constraint, and ask a structural question: where should diffusion be inserted relative to attention? Our central claim is that diffusion and attention generally do not commute, so inserting the same local operator before versus after attention leads to qualitatively different behaviors. We develop a three-layer operator-theoretic framework that (1) establishes unconditional guarantees for the diffusion subsystem, including spectral non-expansiveness and monotone Dirichlet-energy dissipation when the diffusion step size is smaller than one half, (2) derives compositional perturbation bounds linking insertion effects to representation roughness and downstream amplification, and (3) uses diffusion-attention non-commutativity as a diagnostic for structural double-mixing conflicts. Guided by theory, we evaluate seven insertion positions on the Long Range Arena benchmark. Early diffusion acts as effective pre-regularization, improving average accuracy by 4.1 percentage points when applied after embedding, while post-attention diffusion degrades performance by 2.5 percentage points, consistent with the predicted conflict. A multi-scale diffusion variant yields consistent gains under the same global stability constraint. Our analysis provides a general template for reasoning about local-global compositions in sequence models by separating provable guarantees, compositional bounds, and mechanistic diagnostics.

翻译：Transformer通过自注意力机制实现了强大的基于内容的全局路由，但其在序列轴上缺乏显式的局部几何先验。因此，在混合架构中引入局部性模块的位置选择大多依赖经验性方法。本研究探讨一种简单的确定性偏微分方程扩散层，该层采用离散诺伊曼拉普拉斯算子进行一维热平滑的显式欧拉单步计算，并受谱稳定性条件约束。我们提出一个结构性问题：扩散操作应插入注意力机制的相对位置？我们的核心论点是：扩散与注意力通常不可交换，因此在注意力前或后插入相同的局部算子会导致性质不同的行为。我们构建了一个三层算子理论框架：（1）为扩散子系统建立无条件保证，包括当扩散步长小于二分之一时的谱非扩张性与狄利克雷能量单调耗散；（2）推导组合扰动边界，将插入效应与表示粗糙度及下游放大效应相关联；（3）利用扩散-注意力的非交换性作为结构双重混合冲突的诊断依据。在理论指导下，我们在长距离竞技场基准测试中评估了七种插入位置。早期扩散作为有效的预正则化手段，在嵌入层后应用时平均准确率提升4.1个百分点；而注意力后扩散则导致性能下降2.5个百分点，这与理论预测的冲突一致。在相同全局稳定性约束下，多尺度扩散变体实现了持续的性能增益。我们的分析通过分离可证明的保证条件、组合边界与机制诊断，为序列模型中局部-全局组合的推理提供了通用理论框架。