Long-context large language models (LLMs) have achieved remarkable advancements, driven by techniques like Rotary Position Embedding (RoPE) (Su et al., 2023) and its extensions (Chen et al., 2023; Liu et al., 2024c; Peng et al., 2023). By adjusting RoPE parameters and incorporating training data with extended contexts, we can train performant models with considerably longer input sequences. However, existing RoPE-based methods exhibit performance limitations when applied to extended context lengths. This paper presents a comprehensive analysis of various attention mechanisms, including RoPE, No Positional Embedding (NoPE), and Query-Key Normalization (QK-Norm), identifying their strengths and shortcomings in long-context modeling. Our investigation identifies distinctive attention patterns in these methods and highlights their impact on long-context performance, providing valuable insights for architectural design. Building on these findings, we propose a novel architectural based on a hybrid attention mechanism that not only surpasses conventional RoPE-based transformer models in long context tasks but also achieves competitive performance on benchmarks requiring shorter context lengths.
翻译:长上下文大语言模型(LLMs)取得了显著进展,这主要得益于旋转位置编码(RoPE)(Su等人,2023)及其扩展方法(Chen等人,2023;Liu等人,2024c;Peng等人,2023)等技术的推动。通过调整RoPE参数并结合扩展上下文的训练数据,我们能够训练出可处理更长输入序列的高性能模型。然而,现有的基于RoPE的方法在应用于扩展上下文长度时表现出性能局限。本文对多种注意力机制进行了全面分析,包括RoPE、无位置编码(NoPE)以及查询-键归一化(QK-Norm),识别了它们在长上下文建模中的优势与不足。我们的研究揭示了这些方法中独特的注意力模式,并强调了它们对长上下文性能的影响,为架构设计提供了有价值的见解。基于这些发现,我们提出了一种基于混合注意力机制的新型架构,该架构不仅在长上下文任务中超越了传统的基于RoPE的Transformer模型,而且在需要较短上下文长度的基准测试中也取得了具有竞争力的性能。