Attention layers -- which map a sequence of inputs to a sequence of outputs -- are core building blocks of the Transformer architecture which has achieved significant breakthroughs in modern artificial intelligence. This paper presents a rigorous theoretical study on the learning and generalization of a single multi-head attention layer, with a sequence of key vectors and a separate query vector as input. We consider the random feature setting where the attention layer has a large number of heads, with randomly sampled frozen query and key matrices, and trainable value matrices. We show that such a random-feature attention layer can express a broad class of target functions that are permutation invariant to the key vectors. We further provide quantitative excess risk bounds for learning these target functions from finite samples, using random feature attention with finitely many heads. Our results feature several implications unique to the attention structure compared with existing random features theory for neural networks, such as (1) Advantages in the sample complexity over standard two-layer random-feature networks; (2) Concrete and natural classes of functions that can be learned efficiently by a random-feature attention layer; and (3) The effect of the sampling distribution of the query-key weight matrix (the product of the query and key matrix), where Gaussian random weights with a non-zero mean result in better sample complexities over the zero-mean counterpart for learning certain natural target functions. Experiments on simulated data corroborate our theoretical findings and further illustrate the interplay between the sample size and the complexity of the target function.
翻译:注意力层——将输入序列映射到输出序列的模块——是现代人工智能领域取得重大突破的Transformer架构的核心构建块。本文对单头注意力层(以键值向量序列和独立查询向量作为输入)的学习与泛化能力进行了严格的理论研究。我们考虑随机特征设定:注意力层包含大量注意力头,采用随机采样的冻结查询矩阵和键矩阵,而值矩阵则是可训练的。研究表明,此类随机特征注意力层能够表达对键向量具有置换不变性的广泛目标函数。我们进一步基于有限头数的随机特征注意力层,提供了从有限样本中学习这些目标函数的定量超额风险界。与现有神经网络随机特征理论相比,我们的结果凸显了注意力结构特有的若干关键特性:(1)样本复杂度优于标准双层随机特征网络;(2)揭示了随机特征注意力层能够高效学习的具象且自然的函数类别;(3)揭示了查询-键权重矩阵(查询矩阵与键矩阵的乘积)采样分布的影响——对于学习特定自然目标函数,非零均值的高斯随机权重相比零均值对应策略可获得更优的样本复杂度。仿真数据实验验证了我们的理论发现,并进一步揭示了样本量与目标函数复杂度之间的相互作用。