Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. Cluster locations are determined by the initial tokens, confirming context-awareness of representations learned by Transformers. Using techniques from dynamical systems and partial differential equations, we show that the type of limiting object that emerges depends on the spectrum of the value matrix. Additionally, in the one-dimensional case we prove that the self-attention matrix converges to a low-rank Boolean matrix. The combination of these results mathematically confirms the empirical observation made by Vaswani et al. [VSP'17] that leaders appear in a sequence of tokens when processed by Transformers.
翻译:将Transformer视为相互作用的粒子系统,本文描述了在权重不随时间变化时所学表示的几何结构。我们证明,代表标记的粒子会随时间趋于无穷而向特定极限对象聚集。聚类位置由初始标记决定,这证实了Transformer所学习表示的上下文感知特性。利用动力系统和偏微分方程技术,我们证明极限对象类型取决于值矩阵的谱。此外,在一维情形下,我们证明自注意力矩阵收敛至低秩布尔矩阵。这些结果的结合从数学上证实了Vaswani等人[VSP'17]的实证观察:在Transformer处理标记序列时会出现领导者。