Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with $H$ heads, dimension $d$, and context size $n < d$, featuring $\Theta(Hd^2)$ parameters, can memorize $\Omega(Hn)$ examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property. We validate our findings through experiments on synthetic data.
翻译:Transformer已成为处理语言和视觉任务的主流架构,但其理论特性(尤其是记忆容量)仍不清晰。本文研究了多头注意力机制的记忆能力,探讨了在头数和序列长度变化下,其能记忆的示例序列数量。受视觉Transformer实验发现的启发,我们引入了关于输入数据线性独立性的新假设,该假设不同于常用的广义位置假设。在此假设下,我们证明了一个具有$H$个头、维度为$d$且上下文大小$n < d$的注意力层(包含$\Theta(Hd^2)$个参数)能够记忆$\Omega(Hn)$个示例。我们的分析揭示了不同注意力头如何借助softmax算子的饱和特性处理各类示例序列。通过合成数据实验验证了研究发现。