The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.
翻译:Transformer架构在机器学习模型中广泛应用,由两种交替子层构成:注意力头与多层感知机(MLP)。我们证明:若MLP的激活函数属于包含SiLU及ReLU、GeLU的近似逼近在内的限制类别,则内部维度为1的掩码注意力头即可实现单个MLP神经元。这允许将含MLP与注意力的Transformer转换为仅含注意力的Transformer,代价是注意力头数量大幅增加。我们还证明注意力头可分别执行MLP的线性变换与激活函数等组件。最后,我们证明注意力头能够在其权重矩阵中编码任意掩码模式,误差可任意小。