In-context learning (ICL) capabilities are foundational to the success of large language models (LLMs). Recently, context compression has attracted growing interest since it can largely reduce reasoning complexities and computation costs of LLMs. In this paper, we introduce a novel Query-gUIded aTtention cOmpression (QUITO) method, which leverages attention of the question over the contexts to filter useless information. Specifically, we take a trigger token to calculate the attention distribution of the context in response to the question. Based on the distribution, we propose three different filtering methods to satisfy the budget constraints of the context length. We evaluate the QUITO using two widely-used datasets, namely, NaturalQuestions and ASQA. Experimental results demonstrate that QUITO significantly outperforms established baselines across various datasets and downstream LLMs, underscoring its effectiveness. Our code is available at https://github.com/Wenshansilvia/attention_compressor.
翻译:上下文学习能力是大语言模型取得成功的基础。近年来,上下文压缩技术因其能显著降低大语言模型的推理复杂度和计算成本而受到越来越多的关注。本文提出了一种新颖的查询引导注意力压缩方法,该方法利用问题对上下文的注意力来过滤无用信息。具体而言,我们采用一个触发令牌来计算上下文针对问题的注意力分布。基于该分布,我们提出了三种不同的过滤方法,以满足上下文长度的预算约束。我们在两个广泛使用的数据集(NaturalQuestions 和 ASQA)上评估了QUITO。实验结果表明,QUITO在不同数据集和下游大语言模型上均显著优于现有基线,证明了其有效性。我们的代码可在 https://github.com/Wenshansilvia/attention_compressor 获取。