The current large language models are mainly based on decode-only structure transformers, which have great in-context learning (ICL) capabilities. It is generally believed that the important foundation of its ICL capability is the induction heads mechanism, which requires at least two layers attention. In order to more efficiently implement the ability of the model's induction, we revisit the induction heads mechanism and proposed a KV shifting attention. We theoretically prove that the KV shifting attention reducing the model's requirements for the depth and width of the induction heads mechanism. Our experimental results demonstrate that KV shifting attention is beneficial to learning induction heads and language modeling, which lead to better performance or faster convergence from toy models to the pre-training models with more than 10 B parameters.
翻译:当前的大型语言模型主要基于仅解码器结构的Transformer,其具备强大的上下文学习能力。普遍认为,其上下文学习能力的重要基础是归纳头机制,该机制至少需要两层注意力层才能实现。为了更高效地实现模型的归纳能力,我们重新审视了归纳头机制,并提出了一种KV移位注意力方法。我们从理论上证明了KV移位注意力降低了对归纳头机制所需模型深度与宽度的要求。实验结果表明,从玩具模型到参数超过100亿的预训练模型,KV移位注意力均有利于学习归纳头及语言建模,从而带来更好的性能或更快的收敛速度。