Transformers are neural networks which revolutionised natural language processing and machine learning. They process sequences of inputs, like words, using a mechanism called self-attention, which is trained via masked language modelling (MLM). In MLM, a word is randomly masked in an input sequence, and the network is trained to predict the missing word. Despite the practical success of transformers, it remains unclear what type of data distribution self-attention can learn efficiently. Here, we show analytically that if one decouples the treatment of word positions and embeddings, a single layer of self-attention learns the conditionals of a generalised Potts model with interactions between sites and Potts colours. Moreover, we show that training this neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method, well known in statistical physics. Using this mapping, we compute the generalisation error of self-attention in a model scenario analytically using the replica method.
翻译:Transformer是一种革新了自然语言处理和机器学习的神经网络。它们通过一种称为自注意力的机制处理输入序列(如单词),该机制通过掩码语言建模(MLM)进行训练。在MLM中,输入序列中的某个单词被随机掩盖,网络被训练以预测缺失的单词。尽管Transformer在实际应用中取得了成功,但自注意力能高效学习何种数据分布仍不清楚。本文通过分析证明,若将单词位置与嵌入表示解耦处理,单层自注意力机制学习到的是广义Potts模型的条件概率分布,其中包含位点间相互作用与Potts颜色。此外,我们表明训练该神经网络恰好等同于通过统计物理学中著名的伪似然方法求解逆Potts问题。利用这一映射关系,我们通过复制方法在模型场景下解析计算出自注意力的泛化误差。