The ability to process long contexts is crucial for many natural language processing tasks, yet it remains a significant challenge. While substantial progress has been made in enhancing the efficiency of attention mechanisms, there is still a gap in understanding how attention heads function in long-context settings. In this paper, we observe that while certain heads consistently attend to local information only, others swing between attending to local and long-context information depending on the query. This raises the question: can we identify which heads require long-context information to predict the next token accurately? We demonstrate that it's possible to predict which heads are crucial for long-context processing using only local keys. The core idea here is to exploit a simple model for the long-context scores via second moment approximations. These findings unveil simple properties of attention in the context of long sequences, and open the door to potentially significant gains in efficiency.
翻译:处理长上下文的能力对许多自然语言处理任务至关重要,但这仍然是一个重大挑战。尽管在提升注意力机制效率方面已取得显著进展,但对于注意力头在长上下文环境中的运作机制仍缺乏深入理解。本文观察到,某些注意力头始终仅关注局部信息,而另一些则根据查询内容在局部信息与长上下文信息之间动态切换。这引出一个关键问题:我们能否识别哪些注意力头需要长上下文信息来准确预测下一个词元?我们证明,仅利用局部键值即可预测哪些注意力头对长上下文处理至关重要。其核心思想在于通过二阶矩近似构建长上下文得分的简化模型。这些发现揭示了注意力机制在长序列环境中的简洁特性,并为实现潜在的重大效率提升开辟了新途径。