Self-attention and position embedding are two key modules in transformer-based Large Language Models (LLMs). However, the potential relationship between them is far from well studied, especially for long context window extending. In fact, anomalous behaviors harming long context extrapolation exist between Rotary Position Embedding (RoPE) and vanilla self-attention unveiled by our work. To address this issue, we propose a novel attention mechanism, CoCA (Collinear Constrained Attention). Specifically, we enforce a collinear constraint between $Q$ and $K$ to seamlessly integrate RoPE and self-attention. While only adding minimal computational and spatial complexity, this integration significantly enhances long context window extrapolation ability. We provide an optimized implementation, making it a drop-in replacement for any existing transformer-based models. Extensive experiments show that CoCA performs extraordinarily well in extending context windows. A CoCA-based GPT model, trained with a context length of 512, can seamlessly extend the context window up to 32K (60$\times$), without any fine-tuning. Additionally, by dropping CoCA in LLaMA-7B, we achieve extrapolation up to 32K within only 2K training length. Our code is publicly available at: https://github.com/codefuse-ai/Collinear-Constrained-Attention
翻译:摘要:自注意力机制与位置嵌入是基于Transformer的大语言模型(LLMs)的两大核心模块。然而,两者之间的潜在关系远未得到充分研究,尤其是在长上下文窗口扩展场景中。本工作揭示了旋转位置编码(RoPE)与标准自注意力机制之间存在损害长上下文外推能力的异常现象。为解决该问题,我们提出了一种新型注意力机制——CoCA(共线约束注意力)。具体而言,我们通过在$Q$与$K$之间施加共线约束,将RoPE与自注意力无缝融合。该方法仅引入微小的计算与空间复杂度增长,却能显著增强长上下文窗口的外推能力。我们提供了优化实现方案,使其可作为即插即用模块替换现有基于Transformer的模型。大量实验表明,CoCA在扩展上下文窗口方面表现卓越。基于CoCA的GPT模型在512上下文长度训练后,无需微调即可将上下文窗口无缝扩展至32K(60倍)。此外,将CoCA引入LLaMA-7B模型后,仅使用2K训练长度即可实现32K的外推能力。相关代码已开源:https://github.com/codefuse-ai/Collinear-Constrained-Attention