Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA) and key-value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. Hence, DMC can serve as a drop-in replacement for KV caching in existing LLMs to fit longer contexts and larger batches within any given memory budget.
翻译:Transformer已成为大型语言模型(LLM)的骨干架构。然而,由于需要为历史令牌存储键值表示缓存,其内存占用量随输入序列长度和批次大小线性增长,导致生成效率低下。为此,我们提出动态内存压缩(DMC),一种在推理时进行在线键值缓存压缩的方法。最重要的是,模型能够学习在不同注意力头和网络层中应用不同的压缩比率。我们将预训练的LLM(如Llama 2的7B、13B和70B版本)改造为DMC Transformer,在NVIDIA H100 GPU上进行自回归推理时实现了高达7倍的吞吐量提升。DMC仅需对原始数据中可忽略比例的部分进行持续预训练即可实现,且不引入任何额外参数。在高达4倍的缓存压缩率下,DMC能保持原始下游任务性能,其表现优于经过升级训练的分组查询注意力(GQA)及键值驱逐策略(H$_2$O、TOVA)。GQA与DMC还可结合使用以获得复合增益。因此,DMC可作为现有LLM中键值缓存机制的即插即用替代方案,使模型能在给定内存预算内处理更长上下文和更大批次。