When reading books, humans focus primarily on the current page, flipping back to recap prior context only when necessary. Similarly, we demonstrate that Large Language Models (LLMs) can learn to dynamically determine when to attend to global context. We propose All-or-Here Attention (AHA), which utilizes a binary router per attention head to dynamically toggle between full attention and local sliding window attention for each token. Our results indicate that with a window size of 256 tokens, up to 93\% of the original full attention operations can be replaced by sliding window attention without performance loss. Furthermore, by evaluating AHA across various window sizes, we identify a long-tail distribution in context dependency, where the necessity for full attention decays rapidly as the local window expands. By decoupling local processing from global access, AHA reveals that full attention is largely redundant, and that efficient inference requires only on-demand access to the global context.
翻译:在阅读书籍时,人类主要关注当前页面,仅在必要时才翻回前文回顾先前语境。类似地,我们证明大型语言模型(LLMs)能够学会动态决定何时需要关注全局上下文。我们提出全有或此处注意力机制(All-or-Here Attention, AHA),该机制在每个注意力头中使用二元路由器,为每个标记动态切换完整注意力与局部滑动窗口注意力。实验结果表明,在窗口大小为256个标记时,高达93%的原始完整注意力操作可被滑动窗口注意力替代而保持性能不变。此外,通过在不同窗口尺寸下评估AHA,我们发现上下文依赖性呈现长尾分布:随着局部窗口扩大,对完整注意力的需求迅速衰减。通过将局部处理与全局访问解耦,AHA表明完整注意力在很大程度上是冗余的,高效的推理仅需按需访问全局上下文即可实现。