Transformers are neural networks that revolutionized natural language processing and machine learning. They process sequences of inputs, like words, using a mechanism called self-attention, which is trained via masked language modeling (MLM). In MLM, a word is randomly masked in an input sequence, and the network is trained to predict the missing word. Despite the practical success of transformers, it remains unclear what type of data distribution self-attention can learn efficiently. Here, we show analytically that if one decouples the treatment of word positions and embeddings, a single layer of self-attention learns the conditionals of a generalized Potts model with interactions between sites and Potts colors. Moreover, we show that training this neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method, well known in statistical physics. Using this mapping, we compute the generalization error of self-attention in a model scenario analytically using the replica method.
翻译:Transformer是一种彻底改变了自然语言处理和机器学习的神经网络。它通过一种称为自注意力的机制处理输入序列(如单词),这种机制通过掩码语言建模(MLM)进行训练。在MLM中,输入序列中随机掩码一个单词,网络被训练来预测该缺失的单词。尽管Transformer在实践中取得了成功,但自注意力能够高效学习何种类型的数据分布仍不清楚。本文通过分析表明,如果解耦单词位置和嵌入的处理,单层自注意力学习的是广义Potts模型的条件概率,该模型包含位点与Potts颜色之间的相互作用。此外,我们证明训练这个神经网络完全等同于通过统计物理学中著名的伪似然方法求解逆Potts问题。利用这一映射,我们使用复制方法解析计算了自注意力在模型场景中的泛化误差。