Processing long contexts poses a significant challenge for large language models (LLMs) due to their inherent context-window limitations and the computational burden of extensive key-value (KV) activations, which severely impact efficiency. For information-seeking tasks, full context perception is often unnecessary, as a query's information needs can dynamically range from localized details to a global perspective, depending on its complexity. However, existing methods struggle to adapt effectively to these dynamic information needs. In the paper, we propose a method for processing long-context information-seeking tasks via query-guided Activation Refilling (ACRE). ACRE constructs a Bi-layer KV Cache for long contexts, where the layer-1 (L1) cache compactly captures global information, and the layer-2 (L2) cache provides detailed and localized information. ACRE establishes a proxying relationship between the two caches, allowing the input query to attend to the L1 cache and dynamically refill it with relevant entries from the L2 cache. This mechanism integrates global understanding with query-specific local details, thus improving answer decoding. Experiments on a variety of long-context information-seeking datasets demonstrate ACRE's effectiveness, achieving improvements in both performance and efficiency.
翻译:处理长上下文对于大型语言模型(LLMs)而言是一个重大挑战,这源于其固有的上下文窗口限制以及大量键值(KV)激活带来的计算负担,这些因素严重影响了效率。对于信息检索任务,通常无需感知全部上下文,因为查询的信息需求可能根据其复杂性动态变化,从局部细节到全局视角不等。然而,现有方法难以有效适应这些动态信息需求。本文提出了一种通过查询引导的激活重填(ACRE)来处理长上下文信息检索任务的方法。ACRE为长上下文构建了一个双层KV缓存,其中第一层(L1)缓存紧凑地捕获全局信息,第二层(L2)缓存则提供详细且局部化的信息。ACRE在这两层缓存之间建立了代理关系,使得输入查询能够关注L1缓存,并动态地从L2缓存中重填相关条目。该机制将全局理解与查询特定的局部细节相结合,从而提升了解码答案的质量。在多种长上下文信息检索数据集上的实验证明了ACRE的有效性,其在性能和效率方面均取得了提升。