Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.
翻译:现代语言模型日益采用混合架构,将全注意力与高效注意力模块(如滑动窗口注意力SWA和循环序列混合器)相结合。然而,这些高效模块如何塑造模型能力仍未被充分理解。为弥补这一空白,我们从缩放行为、机制分析和架构设计三个视角,对混合架构进行了系统性分析。首先,从缩放视角来看,我们发现高效注意力设计主要影响长上下文能力的涌现速度,而在充分训练下,不同的混合模型最终会收敛到可比的长上下文性能。其次,从机制层面,我们表明长距离检索主要由全注意力承担,而高效注意力则塑造其优化轨迹。这解释了我们称之为“大窗口懒惰”的反直觉现象:较大的SWA窗口可能会延迟全注意力层中检索头的形成。第三,在此机制的指导下,我们表明,仅对小型窗口SWA混合的全注意力层应用NoPE,能显著提升长上下文性能,同时对短上下文性能的影响微乎其微。