Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key-value cache compression at inference time. Most importantly, the model learns to apply different compression rates in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to ~3.7x throughput increase in auto-regressive inference on a NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget.
翻译:Transformer已成为大语言模型(LLM)的基石。然而,由于需要存储过去令牌的键值表示缓存(其大小随输入序列长度和批次大小线性增长),导致生成效率低下。为此,我们提出动态内存压缩(DMC),一种推理时在线键值缓存压缩方法。最重要的是,模型在不同注意力头和不同层中学习应用不同的压缩率。我们将预训练的LLM(如Llama 2的7B、13B和70B版本)改造为DMC Transformer,在NVIDIA H100 GPU上实现了自回归推理吞吐量最高约3.7倍的提升。DMC通过在原始数据中极小比例的数据上进行持续预训练实现,且不引入任何额外参数。我们发现DMC在缓存压缩率高达4倍时仍能保持原始下游任务性能,优于通过额外训练获得的分组查询注意力(GQA)。GQA和DMC甚至可以结合使用以获得复合增益。因此,在给定内存预算内,DMC能够适配更长的上下文和更大的批次。