Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.
翻译:序列到序列任务通常受益于长上下文,但标准Transformer中自注意力机制的二次复杂度使得这一点难以实现。在生成过程中,临时表示——存储在所谓的KV缓存中——占据了GPU内存使用的大部分,并且其规模随上下文长度线性增长。我们提出了KV-Distill,一种Transformer压缩框架,它以与问题无关的方式将长上下文KV缓存蒸馏为显著更短的表示。KV-Distill可以作为预训练模型的参数高效适配器进行训练,并能在保持预训练模型能力的同时压缩上下文的任意片段。我们将压缩与未压缩的缓存视为师生配对,并应用一种KL型散度来匹配生成的输出。在最坏情况下的抽取式任务中,KV-Distill优于其他压缩技术,并在长上下文问答与摘要任务中接近未压缩模型的性能;此外,它可以在特定领域上下文中进行微调,将长度减少高达99%,同时保持下游性能。我们展示了KV-Distill在不同模型规模和架构上的泛化能力。