Transformers are neural networks which revolutionised natural language processing and machine learning. They process sequences of inputs, like words, using a mechanism called self-attention, which is trained via masked language modelling (MLM). In MLM, a word is randomly masked in an input sequence, and the network is trained to predict the missing word. Despite the practical success of transformers, it remains unclear what type of data distribution self-attention can learn efficiently. Here, we show analytically that if one decouples the treatment of word positions and embeddings, a single layer of self-attention learns the conditionals of a generalised Potts model with interactions between sites and Potts colours. Moreover, we show that training this neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method, well known in statistical physics. Using this mapping, we compute the generalisation error of self-attention in a model scenario analytically using the replica method.
翻译:Transformer是一种神经网络,彻底改变了自然语言处理和机器学习领域。它通过一种称为自注意力(self-attention)的机制处理输入序列(如单词),该机制通过掩码语言建模(MLM)进行训练。在MLM中,输入序列中的某个单词被随机掩码,网络被训练以预测缺失的单词。尽管Transformer在实际应用中取得了成功,但自注意力机制究竟能高效学习何种类型的数据分布仍不清楚。本文从理论上证明,如果将单词位置和嵌入的处理解耦,单层自注意力机制学习到的正是广义Potts模型的条件概率,该模型包含位点与Potts颜色之间的相互作用。此外,我们证明训练该神经网络等价于通过统计物理学中著名的伪似然方法求解逆Potts问题。利用这一对应关系,我们采用复制方法(replica method)在模型场景下解析计算了自注意力机制的泛化误差。