Self-attention and position embedding are two key modules in Transformer based LLMs. The potential relationship among them are far from well studied, especially for context window extending. In this paper, we introduce collinear constrained relationship to fuse RoPE and self-attention, and name it as Collinear Constrained Attention (CoCA). We've analyzed the computational and spatial complexity of CoCA and have determined that it adds only minimal additional overhead compared to the original Transformer-based models. We provide an efficient implementation of CoCA, and make it drop-in replacement for any existing position embedding and attention modules in Transformer based models. Experiments show that CoCA performs extraordinary well on context window extending. For instance, a CoCA based GPT model trained with 512 context length can extend the context window up to 8K without perplexity diverging. This indicates more than 16x context window extending without any fine-tuning. Our code is released here: https://github.com/codefuse-ai/Collinear-Constrained-Attention
翻译:自注意力机制与位置编码是基于Transformer的大语言模型的两大核心模块。两者间的潜在关系尚未得到充分研究,尤其在上下文窗口扩展方面。本文引入共线约束关系将RoPE与自注意力机制进行融合,并命名为共线约束注意力(CoCA)。我们分析了CoCA的计算复杂度与空间复杂度,发现相较于原始Transformer模型,其仅增加极小的额外开销。我们提供了CoCA的高效实现,使其可作为现有Transformer模型中任意位置编码与注意力模块的即插即用替代方案。实验表明,CoCA在上下文窗口扩展任务中表现卓越。例如,基于CoCA的GPT模型在512上下文长度下训练后,可将上下文窗口扩展至8K而无需困惑度发散,即在无需任何微调的情况下实现超过16倍的上下文窗口扩展。我们的代码已开源:https://github.com/codefuse-ai/Collinear-Constrained-Attention