The quadratic complexity of the attention module makes it gradually become the bulk of compute in Transformer-based LLMs during generation. Moreover, the excessive key-value cache that arises when dealing with long inputs also brings severe issues on memory footprint and inference latency. In this work, we propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones, thereby reducing both memory and computational cost when processing subsequent context. Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach over sparse attention baselines in terms of fluency, n-gram matching, and semantic similarity. At last, we comprehensively profile the benefit of context compression on improving the system throughout. Code is available at https://github.com/DRSY/KV_Compression.
翻译:注意力模块的二次复杂度使其在基于Transformer的大语言模型生成过程中逐渐成为计算量的主要来源。此外,处理长输入时产生的过量键值缓存还会引发内存占用和推理延迟的严重问题。本文提出一种即插即用方法,能够将指定跨度令牌的中间激活增量式压缩为紧凑表示,从而在后续上下文处理中同时降低内存与计算开销。在领域内语言建模和零样本开放式文档生成实验表明,本方法在流畅度、n元语法匹配和语义相似度方面均优于稀疏注意力基线。最后,我们全面分析了上下文压缩对系统吞吐量提升的收益。代码已开源至 https://github.com/DRSY/KV_Compression 。