This paper introduces a novel approach, the Bounded-Cache Transformer (BCT), for building large language models with a predefined Key-Value (KV) cache capacity. The BCT addresses the excessive memory consumption issue in traditional KV caches by implementing a bounded-length KV cache, which is particularly suitable for the attention layers in Transformer decode-only architectures. By dynamically updating the key-value vector sequences, the BCT achieves efficient inference within limited cache capacity, significantly reducing memory usage while maintaining model performance and system throughput. Experimental results demonstrate that the BCT significantly reduces memory usage while maintaining the model's inference quality, offering a new solution for efficient inference in large language models.
翻译:本文提出了一种新颖的方法——有界缓存Transformer(BCT),用于构建具有预定义键值(KV)缓存容量的大语言模型。BCT通过实现有界长度的KV缓存,解决了传统KV缓存中内存消耗过大的问题,特别适用于Transformer纯解码架构中的注意力层。通过动态更新键值向量序列,BCT在有限的缓存容量内实现了高效推理,显著降低了内存使用量,同时保持了模型性能和系统吞吐量。实验结果表明,BCT在保持模型推理质量的同时显著降低了内存使用,为大语言模型的高效推理提供了一种新的解决方案。