In this paper, we demonstrate that an inherent waveform pattern in the attention allocation of large language models (LLMs) significantly affects their performance in tasks demanding a high degree of context awareness, such as utilizing LLMs for tool-use. Specifically, the crucial information in the context will be potentially overlooked by model when it is positioned in the trough zone of the attention waveform, leading to decreased performance. To address this issue, we propose a novel inference method named Attention Buckets. It allows LLMs to process their input through multiple parallel processes. Each process utilizes a distinct base angle for the rotary position embedding, thereby creating a unique attention waveform. By compensating an attention trough of a particular process with an attention peak of another process, our approach enhances LLM's awareness to various contextual positions, thus mitigating the risk of overlooking crucial information. In the largest tool-use benchmark, our method elevates a 7B model to achieve state-of-the-art performance, comparable to that of GPT-4. On other benchmarks and some RAG tasks, which also demand a thorough understanding of contextual content, Attention Buckets also exhibited notable enhancements in performance.
翻译:本文揭示,大型语言模型(LLMs)注意力分配中固有的波形模式会显著影响其在需要高度上下文感知的任务(如使用LLMs进行工具调用)中的表现。具体而言,当关键信息位于注意力波形的波谷区域时,模型可能忽略该信息,导致性能下降。为解决这一问题,我们提出一种名为"注意力桶"(Attention Buckets)的新型推理方法。该方法通过并行处理流程处理输入,每个流程使用不同的旋转位置编码基角,从而产生独特的注意力波形。通过利用某一流程的注意力波峰补偿另一流程的注意力波谷,我们增强了LLMs对不同上下文位置的感知能力,从而降低遗漏关键信息的风险。在最大的工具使用基准测试中,我们的方法使7B模型达到与GPT-4相当的最先进性能。在同样需要深入理解上下文的其它基准测试和部分检索增强生成(RAG)任务中,注意力桶方法也展现出显著的性能提升。