滑动窗口注意力机制适配方法 (Sliding Window Attention Adaptation)

The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios, which can greatly and fundamentally accelerate LLM long-context inference speed by up to 100%. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation

翻译：基于Transformer架构的大语言模型（LLMs）中的自注意力机制的计算复杂度随输入长度呈二次方增长，导致长上下文推理成本高昂。滑动窗口注意力（SWA）可将计算复杂度降至线性级别，但若在推理阶段为采用完整注意力（FA）预训练的模型直接启用SWA，会因训练与推理机制不匹配而导致长上下文性能严重下降。这引发我们思考：能否在不重新预训练的情况下，使FA预训练的LLMs有效适配SWA？为此，我们提出滑动窗口注意力适配（SWAA）方法，通过整合五种策略实现高效适配：（1）仅在预填充阶段应用SWA；（2）保留“锚点”标记；（3）交错配置FA/SWA层；（4）思维链（CoT）推理；（5）微调训练。实验表明，SWA适配具有可行性但非平凡：单一方法均不足够，而特定协同组合能有效恢复原始长上下文性能。我们进一步分析了不同SWAA配置在性能与效率间的权衡，针对多样化场景提供推荐方案，可从根本上将LLM长上下文推理速度提升最高达100%。代码已开源：https://github.com/yuyijiong/sliding-window-attention-adaptation

相关内容