The quadratic complexity of self attention in Transformer based LLMs renders long context inference prohibitively expensive. While Sliding Window Attention (SWA), the simplest sparse attention pattern, offers a linear complexity alternative, it suffers from catastrophic long context performance collapse, which stems from two fundamental factors: the training inference mismatch when naively applying SWA to models pretrained with Full Attention (FA), and the inherent structural inability to access distant information when applying SWA to every module at all times. To address these dual challenges, we propose Sliding Window Attention Adaptation (SWAA), a plug and play toolkit of recipes that adapts FA models to SWA without costly pretraining. SWAA systematically combines four core strategies to tackle these distinct issues: (1) Full Attention (FA) Decode and (2) Interleaving FA and SWA layers, which mitigate structural defects by selectively allowing access to distant information; alongside (3) preserving ``sink'' tokens and (4) lightweight fine tuning, which mitigate the training inference mismatch. Our experiments reveal that while isolated strategies are insufficient, specific synergistic combinations effectively recover long context performance. Despite varying computational overheads, our performance efficiency trade off analysis identifies optimal SWAA configurations for diverse scenarios, achieving 30% to 100% speedups for long context inference with acceptable quality retention. Our code, data and model weights are available at https://github.com/yuyijiong/sliding-window-attention-adaptation
翻译:基于Transformer的大型语言模型中自注意力的二次复杂度导致长上下文推理成本过高。尽管最简单的稀疏注意力模式——滑动窗口注意力(SWA)提供了线性复杂度的替代方案,但其存在灾难性的长上下文性能崩塌,这源于两个根本因素:将SWA朴素应用于全注意力(FA)预训练模型时产生的训练-推理不匹配,以及对所有模块始终应用SWA时无法获取远距离信息的固有结构缺陷。为应对上述双重挑战,我们提出滑动窗口注意力自适应(SWAA),这是一个即插即用的工具包方案,能够在无需昂贵预训练的情况下将FA模型适配为SWA。SWAA系统性地结合四项核心策略来解决这些不同问题:(1)全注意力解码和(2)FA与SWA层的交错部署,通过选择性允许访问远距离信息来缓解结构缺陷;同时(3)保留“汇点”词元和(4)轻量级微调以缓解训练-推理不匹配。实验表明,尽管孤立策略效果不足,但特定的协同组合能有效恢复长上下文性能。在考虑不同计算开销的情况下,我们的性能-效率权衡分析为不同场景确定了最优SWAA配置,可在保持可接受质量的同时实现30%至100%的长上下文推理加速。我们的代码、数据和模型权重可在https://github.com/yuyijiong/sliding-window-attention-adaptation获取。