The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8xA800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.
翻译:标准注意力机制的二次复杂度为大型语言模型在长上下文场景下的可扩展性带来了显著瓶颈。虽然将稀疏注意力与全注意力结合于单一模型内的混合注意力策略提供了一种可行的解决方案,但这些策略通常采用静态计算比率(即稀疏注意力与全注意力的固定比例),并且无法在推理过程中适应下游任务对稀疏度敏感性的变化。为解决此问题,我们提出了弹性注意力,该机制允许模型根据输入动态调整其整体稀疏度。这是通过在现有预训练模型中集成一个轻量级的注意力路由器来实现的,该路由器动态地将每个注意力头分配到不同的计算模式。仅需在8张A800 GPU上进行12小时的训练,我们的方法就能使模型同时实现强大的性能和高效的推理。在广泛使用的大型语言模型上进行的三个长上下文基准测试实验证明了我们方法的优越性。