Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.
翻译:有效处理长上下文是语言模型面临的关键挑战。虽然标准Transformer受限于二次复杂度与较差的长度外推能力,而滑动窗口注意力与状态空间模型等替代架构因其固定大小的记忆容量牺牲了有效利用完整上下文的能力。基于分块的稀疏注意力已成为实现极端长度泛化的有前景范式,然而支撑其成功的关键架构原理尚未被完全理解。本研究系统剖析了此类模型以识别驱动其性能的核心组件。通过统一框架与全面的消融实验,我们证明以下三个设计原则的组合至关重要:(1) 具有专用CLS标记的表达性非线性分块编码器,用于生成检索表征;(2) 旁路残差路径以稳定整合检索到的全局信息,避免其被局部残差流覆盖;(3) 预训练期间强制选择稀疏性以弥合训练-测试分布差距。我们为块内信息处理与地标生成提供了理论依据。通过整合这些原则,我们在无需重新训练的长度外推任务上建立了新的最优结果,成功将在4K上下文训练的模型泛化至RULER和BABILong数据集的3200万词元。我们的研究结果为开发未来高性能长上下文语言模型提供了一套清晰且基于实证的设计原则。