Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. Cluster locations are determined by the initial tokens, confirming context-awareness of representations learned by Transformers. Using techniques from dynamical systems and partial differential equations, we show that the type of limiting object that emerges depends on the spectrum of the value matrix. Additionally, in the one-dimensional case we prove that the self-attention matrix converges to a low-rank Boolean matrix. The combination of these results mathematically confirms the empirical observation made by Vaswani et al. [VSP'17] that leaders appear in a sequence of tokens when processed by Transformers.
翻译:将Transformer视为相互作用的粒子系统,我们描述了权重非时间依赖时学习表示的几何结构。我们证明,代表令牌的粒子在时间趋于无穷时趋向于聚集到特定的极限目标。簇群位置由初始令牌决定,这证实了Transformer学习表示的上下文感知特性。利用动力系统和偏微分方程的技术,我们证明了极限目标出现的类型取决于值矩阵的谱。此外,在一维情形下,我们证明自注意力矩阵收敛到低秩布尔矩阵。这些结果的结合在数学上证实了Vaswani等人[VSP'17]的实证观察,即当令牌由Transformer处理时,序列中会出现领导者。