We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention's internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention alone suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating various statistical models in-context. We believe these techniques hold independent interest.
翻译:我们证明了通过线性变换,(i) 两层自注意力机制与(ii) 单层自注意力后接softmax函数均能成为紧致域上连续序列到序列函数的通用逼近器。我们的核心技术是一种基于插值的新方法,用于分析注意力的内部机制。这引出了我们的核心发现:自注意力能够以任意精度逼近广义版本的ReLU函数,从而涵盖了许多已知的通用逼近器。在此基础上,我们证明仅需两层多头注意力即可作为序列到序列的通用逼近器。相比之下,先前的研究依赖前馈网络来建立Transformer中的通用逼近性。此外,我们扩展了技术方法,证明仅含(softmax)注意力的层能够实现多种上下文统计模型的逼近。我们相信这些技术本身具有独立的研究价值。