Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice. This is primarily due to the interpretation of the softmax function as an approximation of the hardmax function. By clarifying the connection between the softmax function and the Boltzmann operator, we prove that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. As a consequence, we show that one-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
翻译:现有关于Transformer模型表达能力的研究需要极深的网络层数才能实现数据记忆,这与实际使用的Transformer存在差异。这主要源于将softmax函数解释为hardmax函数的近似。通过阐明softmax函数与玻尔兹曼算子之间的联系,我们证明了具有低秩权重矩阵的单层自注意力机制能够完美捕获整个输入序列的上下文信息。由此,我们证明单层单头Transformer对有限样本具有记忆能力,且由一个自注意力层与两个前馈神经网络组成的Transformer是紧致域上连续置换等变函数的通用近似器。