Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy.
翻译:变换器已成为大型语言模型(LLMs)的支柱架构。在生成式语言模型中,推理过程涉及两个主要阶段:提示处理和令牌生成。令牌生成构成了主要计算工作负载,主要涉及向量-矩阵乘法以及键值(KV)缓存的交互。由于从内存系统向计算单元传输权重和KV缓存值的开销,此阶段受限于内存带宽。这一内存瓶颈在需要长上下文和大量文本生成的应用中尤为突出,而这两者对LLMs日益关键。本文介绍了一种创新的推理阶段方法“Keyformer”,旨在缓解KV缓存大小和内存带宽利用相关的挑战。Keyformer利用了生成式推理中约90%的注意力权重集中于特定子集令牌(称为“关键”令牌)的观察。Keyformer通过一种新颖的评分函数识别这些关键令牌,并仅将其保留在KV缓存中。该方法在不影响模型准确性的前提下,有效减少了KV缓存大小和内存带宽使用。我们评估了Keyformer在三种基础模型上的性能:GPT-J、Cerebras-GPT和MPT,这些模型采用了不同的位置嵌入算法。我们的评估涵盖多种任务,特别关注涉及扩展上下文的摘要和对话任务。Keyformer通过减少KV缓存将推理延迟降低了2.1倍,并将令牌生成吞吐量提升了2.4倍,同时保持了模型准确性。