Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches. We open-sourced our Pallas kernels along with model codes to facilitate further research effort.
翻译:局部-全局注意力模型近期作为标准Transformer的有力替代方案崭露头角,有望在训练和推理效率上实现双重提升。然而,窗口尺寸的关键选择构成了帕累托权衡:较大窗口虽能保持接近全注意力的性能,但在短上下文场景中效率增益有限;较小窗口则可能导致性能下降。现有模型(如Gemma2和Mistral)为维持性能而采用保守窗口尺寸(例如在8192预训练长度中选用4096)。本研究旨在探索移动该帕累托前沿的策略,使局部-全局模型在短上下文场景中也能实现效率提升。我们的核心动机是解决局部注意力的固有局限——对定义窗口外的标记完全忽略。我们提出RATTENTION,这是一种融合了专为捕获窗口外标记信息而设计的线性注意力机制的局部注意力变体。在3B和12B规模上的预训练实验表明,RATTENTION在性能与效率间实现了更优的帕累托权衡。作为最佳平衡点,仅采用512窗口尺寸的RATTENTION在多种设定下均能持续匹配全注意力模型的性能。此外,RATTENTION中线性注意力组件固有的循环特性增强了长上下文性能,这在RULER基准测试中得到了验证。关键的是,这些改进并未牺牲训练效率:得益于专用内核实现与缩减的窗口尺寸,RATTENTION保持了与现有先进方法相当的训练速度。我们开源了Pallas内核及模型代码以促进后续研究。