Transformers owe much of their empirical success in natural language processing to the self-attention blocks. Recent perspectives interpret attention blocks as interacting particle systems, whose mean-field limits correspond to gradient flows of interaction energy functionals on probability density spaces equipped with Wasserstein-$2$-type metrics. We extend this viewpoint by introducing accelerated attention blocks derived from inertial Nesterov-type dynamics on density spaces. In our proposed architecture, tokens carry both spatial (feature) and velocity variables. The time discretization and the approximation of accelerated density dynamics yield Hamiltonian momentum attention blocks, which constitute the proposed accelerated attention architectures. In particular, for linear self-attention, we show that the attention blocks approximate a Stein variational gradient flow, using a bilinear kernel, of a potential energy. In this setting, we prove that elliptically contoured probability distributions are preserved by the accelerated attention blocks. We present implementable particle-based algorithms and demonstrate that the proposed accelerated attention blocks converge faster than the classical attention blocks while preserving the number of oracle calls.
翻译:Transformer 在自然语言处理中的实证成功很大程度上归功于自注意力模块。近期研究将注意力模块解释为相互作用的粒子系统,其平均场极限对应于在配备 Wasserstein-$2$ 型度量的概率密度空间上,相互作用能量泛函的梯度流。我们通过引入源自密度空间上惯性 Nesterov 型动力学的加速注意力模块,拓展了这一观点。在我们提出的架构中,标记同时携带空间(特征)变量和速度变量。时间离散化与加速密度动力学的近似产生了哈密顿动量注意力模块,这些模块构成了所提出的加速注意力架构。特别地,对于线性自注意力,我们证明了注意力模块使用双线性核近似了势能的 Stein 变分梯度流。在此设定下,我们证明了椭圆轮廓概率分布由加速注意力模块保持。我们提出了可实现的基于粒子的算法,并证明了所提出的加速注意力模块在保持预言机调用次数不变的同时,比经典注意力模块收敛得更快。