Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.

翻译：有效处理长上下文是语言模型面临的关键挑战。标准Transformer受限于二次复杂度和较差的长度外推能力，而滑动窗口注意力和状态空间模型等替代架构因使用固定大小记忆而牺牲了有效利用完整上下文的能力。基于分块的稀疏注意力已成为实现极端长度泛化的有前景范式，然而支撑其成功的关键架构原理尚未被完全理解。在本工作中，我们对这些模型进行了系统性剖析，以识别驱动其性能的核心组件。通过统一框架和全面的消融研究，我们证明三个设计原则的组合至关重要：（1）一个具有专用CLS令牌的表达性非线性分块编码器，用于生成检索表示；（2）一条旁路残差路径，以稳定地整合检索到的全局信息而不被局部残差流覆盖；（3）在预训练期间强制执行的选择性稀疏性，以弥合训练-测试分布差距。我们还为分块内信息处理和地标生成提供了理论动机。通过结合这些原则，我们建立了无训练长度外推的最新最优结果，成功将在4K上下文中训练的模型泛化到RULER和BABILong任务上的3200万个令牌。我们的发现为开发未来高性能长上下文语言模型提供了一套清晰且基于实证的设计原则。