Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under the finite-width thermodynamic limit, i.e., $N,P\rightarrow\infty$, $P/N=\mathcal{O}(1)$, where $N$ is the network width and $P$ is the number of training examples. Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different 'attention paths', defined as information pathways through different attention heads across layers. The kernels are weighted according to a 'task-relevant kernel combination' mechanism that aligns the total kernel with the task labels. As a consequence, this interplay between attention paths enhances generalization performance. Experiments confirm our findings on both synthetic and real-world sequence classification tasks. Finally, our theory explicitly relates the kernel combination mechanism to properties of the learned weights, allowing for a qualitative transfer of its insights to models trained via gradient descent. As an illustration, we demonstrate an efficient size reduction of the network, by pruning those attention heads that are deemed less relevant by our theory.
翻译:尽管Transformer模型在实证中表现卓越,但其理论理解仍不明晰。本文研究一种与Transformer密切相关但具有解析可处理性的深度多头自注意力网络。我们建立了该模型中贝叶斯学习的统计力学理论,在有限宽度热力学极限(即$N,P\rightarrow\infty$,$P/N=\mathcal{O}(1)$,其中$N$为网络宽度,$P$为训练样本数)下推导出网络预测器统计量的精确方程。我们的理论表明,预测器统计量可表示为多个独立核函数的和,每个核函数对应不同"注意力路径"的配对——这些路径定义为信息在不同层间通过不同注意力头的传播通道。核函数通过"任务相关核组合"机制进行加权,该机制使总核函数与任务标签对齐。因此,注意力路径间的这种相互作用能提升泛化性能。在合成数据与真实世界序列分类任务上的实验均验证了我们的发现。最后,我们的理论明确将核组合机制与学习权重的特性相关联,使得其洞见可定性迁移至梯度下降训练的模型。作为例证,我们通过剪枝理论判定为相关性较低的注意力头,实现了网络的高效压缩。