LongHeads: Multi-Head Attention is Secretly a Long Context Processor

Large language models (LLMs) have achieved impressive performance in numerous domains but often struggle to process lengthy inputs effectively and efficiently due to limited length generalization and attention's quadratic computational demands. Many sought to mitigate this by restricting the attention window within the pre-trained length. However, these methods introduce new issues such as ignoring the middle context and requiring additional training. To address these problems, we propose LongHeads, a training-free framework that enhances LLM's long context ability by unlocking multi-head attention's untapped potential. Instead of allowing each head to attend to the full sentence, which struggles with generalizing to longer sequences due to out-of-distribution (OOD) issues, we allow each head to process in-distribution length by selecting and attending to important context chunks. To this end, we propose a chunk selection strategy that relies on the inherent correlation between the query and the key representations, efficiently distributing context chunks to different heads. In this way, each head ensures it can effectively process attended tokens within the trained length, while different heads in different layers can collectively process longer contexts. LongHeads works efficiently in linear time, fits seamlessly with many LLMs that use relative positional encoding. LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task, verifying LongHeads's efficacy in extending the usable context window for existing models. We release our code at https://github.com/LuLuLuyi/LongHeads .

翻译：大型语言模型（LLMs）在众多领域取得了显著性能，但由于长度泛化能力有限以及注意力机制的二次计算复杂度，通常难以高效处理长序列输入。许多方法试图通过将注意力窗口限制在预训练长度内来缓解此问题，然而这些方法引入了忽视中间上下文、需要额外训练等新问题。为解决上述挑战，我们提出LongHeads——一种无需训练的框架，通过释放多头注意力机制的未开发潜力来增强LLM的长上下文处理能力。不同于允许每个注意力头处理完整句子（因分布外问题难以泛化至更长序列），我们让每个头通过筛选并关注重要上下文块来处理分布内长度的序列。为此，我们提出基于查询与键表示内在相关性的分块选择策略，将上下文块高效分配至不同注意力头。通过这种方式，每个头可确保在训练长度内有效处理所关注的标记，而不同层的不同头则能协同处理更长的上下文。LongHeads以线性时间复杂度高效运行，可无缝适配采用相对位置编码的多种LLM。在128k长度的密码检索任务中，LongHeads实现了100%准确率，验证了其在扩展现有模型可用上下文窗口方面的有效性。我们的代码已开源至https://github.com/LuLuLuyi/LongHeads。