The emergence of long-context text applications utilizing large language models (LLMs) has presented significant scalability challenges, particularly in memory footprint. The linear growth of the Key-Value (KV) cache responsible for storing attention keys and values to minimize redundant computations can lead to substantial increases in memory consumption, potentially causing models to fail to serve with limited memory resources. To address this issue, we propose a novel approach called Cache Sparse Representation (CSR), which converts the KV cache by transforming the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference. Furthermore, we introduce NeuralDict, a novel neural network-based method for automatically generating the dictionary used in our sparse representation. Our extensive experiments demonstrate that CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms while maintaining robust functionality in memory-constrained environments.
翻译:随着利用大语言模型(LLM)的长文本应用的出现,内存占用方面的可扩展性挑战日益凸显。负责存储注意力键值以减少冗余计算的键值(KV)缓存的线性增长,可能导致内存消耗大幅增加,在内存资源有限的情况下可能致使模型无法提供服务。为解决此问题,我们提出了一种称为缓存稀疏表示(CSR)的新方法,该方法通过将稠密的键值缓存张量转换为稀疏索引和权重,在LLM推理过程中提供了一种内存效率更高的表示形式。此外,我们引入了NeuralDict,这是一种基于神经网络的新方法,用于自动生成我们稀疏表示中使用的字典。我们的大量实验表明,CSR在内存受限环境下保持稳健功能的同时,实现了与最先进的KV缓存量化算法相当的性能。