As LLMs have become capable of processing more complex types of inputs, researchers have recently studied how to efficiently and affordably process possibly arbitrarily long sequences. One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. However, this approach requires a large memory and/or takes into the consideration the specific LM architecture. Moreover, due to the causal nature between the key-values in prior context and the queries at present, this approach cannot be extended to bidirectional attention such as in an encoder-decoder or PrefixLM decoder-only architecture. In this paper, we propose to use eviction policies, such as LRA and LFA, to reduce the memory size and adapt to various architectures, and we also propose the Attendre layer, a wait-to-attend mechanism by retrieving the key-value memory (K/V memory) with evicted queries in the query memory (Q memory). As a first step, we evaluate this method in the context length extension setup using the TriviaQA reading comprehension task, and show the effectiveness of the approach.
翻译:随着大语言模型能够处理更复杂类型的输入,研究者近期开始探索如何高效且低成本地处理可能任意长的序列。一种有效方法是通过FIFO内存存储注意力子层过去块的键值对,使后续查询能够执行注意力操作。然而,这种方法需要大容量内存或需考虑特定语言模型架构。此外,由于先前上下文键值对与当前查询之间的因果性质,该方法无法扩展到编码器-解码器或PrefixLM解码器-only架构中的双向注意力机制。本文提出采用LRA和LFA等驱逐策略来缩减内存大小并适应不同架构,同时提出Attendre层——一种通过检索被驱逐查询(存储于查询内存Q memory)对应的键值内存(K/V memory)来实现“等待处理”的机制。作为初步验证,我们在TriviaQA阅读理解任务的长上下文扩展场景中评估了该方法,实验结果证明了其有效性。