Large language models (LLMs) represent a groundbreaking advancement in the domain of natural language processing due to their impressive reasoning abilities. Recently, there has been considerable interest in increasing the context lengths for these models to enhance their applicability to complex tasks. However, at long context lengths and large batch sizes, the key-value (KV) cache, which stores the attention keys and values, emerges as the new bottleneck in memory usage during inference. To address this, we propose Eigen Attention, which performs the attention operation in a low-rank space, thereby reducing the KV cache memory overhead. Our proposed approach is orthogonal to existing KV cache compression techniques and can be used synergistically with them. Through extensive experiments over OPT, MPT, and Llama model families, we demonstrate that Eigen Attention results in up to 40% reduction in KV cache sizes and up to 60% reduction in attention operation latency with minimal drop in performance. Code is available at https://github.com/UtkarshSaxena1/EigenAttn.
翻译:大型语言模型(LLMs)因其卓越的推理能力,在自然语言处理领域取得了突破性进展。近期,为提升模型在复杂任务中的适用性,增加其上下文长度引起了广泛关注。然而,在长上下文和大批量处理场景下,存储注意力键值的键值(KV)缓存成为推理过程中内存使用的新瓶颈。为解决这一问题,我们提出特征注意力,该方法在低秩空间中执行注意力运算,从而降低KV缓存的内存开销。我们提出的方法与现有的KV缓存压缩技术正交,并可与之协同使用。通过对OPT、MPT和Llama模型系列的广泛实验,我们证明特征注意力可将KV缓存大小降低高达40%,并将注意力运算延迟降低高达60%,同时性能损失极小。代码可在https://github.com/UtkarshSaxena1/EigenAttn获取。