Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. Cluster locations are determined by the initial tokens, confirming context-awareness of representations learned by Transformers. Using techniques from dynamical systems and partial differential equations, we show that the type of limiting object that emerges depends on the spectrum of the value matrix. Additionally, in the one-dimensional case we prove that the self-attention matrix converges to a low-rank Boolean matrix. The combination of these results mathematically confirms the empirical observation made by Vaswani et al. [VSP'17] that leaders appear in a sequence of tokens when processed by Transformers.
翻译:将Transformer视为相互作用的粒子系统,本文描述了权重非时变条件下学习表征的几何结构。我们证明,代表词元的粒子在时间趋于无穷时倾向于向特定极限对象聚集。聚类位置由初始词元决定,这证实了Transformer学习表征的上下文感知特性。利用动力系统与偏微分方程技术,我们证明了极限对象的类型取决于价值矩阵的谱。此外,在一维情形下,我们证明自注意力矩阵会收敛至低秩布尔矩阵。这些结果的组合从数学角度证实了Vaswani等人[VSP'17]的实验观察:当词元序列经Transformer处理时会出现主导性词元。