Processing long contexts poses a significant challenge for large language models (LLMs) due to their inherent context-window limitations and the computational burden of extensive key-value (KV) activations, which severely impact efficiency. For information-seeking tasks, full context perception is often unnecessary, as a query's information needs can dynamically range from localized details to a global perspective, depending on its complexity. However, existing methods struggle to adapt effectively to these dynamic information needs. In the paper, we propose a method for processing long-context information-seeking tasks via query-guided Activation Refilling (ACRE). ACRE constructs a Bi-layer KV Cache for long contexts, where the layer-1 (L1) cache compactly captures global information, and the layer-2 (L2) cache provides detailed and localized information. ACRE establishes a proxying relationship between the two caches, allowing the input query to attend to the L1 cache and dynamically refill it with relevant entries from the L2 cache. This mechanism integrates global understanding with query-specific local details, thus improving answer decoding. Experiments on a variety of long-context information-seeking datasets demonstrate ACRE's effectiveness, achieving improvements in both performance and efficiency.
翻译:处理长上下文对于大型语言模型(LLMs)而言是一项重大挑战,这源于其固有的上下文窗口限制以及大量键值(KV)激活带来的计算负担,这些因素严重影响了处理效率。对于信息检索任务而言,全上下文感知通常并非必需,因为查询的信息需求会根据其复杂程度,动态地在局部细节与全局视角之间变化。然而,现有方法难以有效适应这些动态的信息需求。本文提出了一种通过查询引导的激活重填(ACRE)机制来处理长上下文信息检索任务的方法。ACRE为长上下文构建了一个双层KV缓存,其中第一层(L1)缓存紧凑地捕获全局信息,第二层(L2)缓存则提供详细且局部化的信息。ACRE在两个缓存之间建立了代理关系,使得输入查询能够关注L1缓存,并动态地从L2缓存中重填相关条目。该机制将全局理解与查询相关的局部细节相结合,从而改进了答案解码过程。在多种长上下文信息检索数据集上的实验证明了ACRE的有效性,在性能与效率方面均取得了提升。