Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice. This is primarily due to the interpretation of the softmax function as an approximation of the hardmax function. By clarifying the connection between the softmax function and the Boltzmann operator, we prove that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. As a consequence, we show that one-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
翻译:现有关于Transformer模型表达能力分析的研究通常需要过深的层数才能实现数据记忆,这与实际使用的Transformer存在差异。这主要源于将softmax函数解释为hardmax函数的近似。通过阐明softmax函数与玻尔兹曼算子之间的关联,我们证明单层低秩权重矩阵的自注意力机制具备完整捕获整个输入序列上下文的能力。由此证明,单层单头Transformer对有限样本具有记忆能力,而由单层自注意力层与两个前馈神经网络构成的Transformer是紧致域上连续置换等变函数的通用逼近器。