A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token through a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 model.
翻译:Transformer的一个关键组成部分是注意力机制,它协调每个标记如何通过Transformer影响其他每个标记的传播。本文对Transformer中注意力机制的渐近性质进行了严格的数学分析。尽管我们基于不同假设提出了若干结果,但它们均指向同一结论:所有标记在渐近意义上相互收敛,这一现象已在文献中得到实证报道。我们的发现与现有理论结果进行了仔细比较,并通过使用GPT-2模型的仿真和实验研究加以说明。