面向大语言模型长上下文推理的高效低秩注意力机制 (Efficient Low Rank Attention for Long-Context Inference in Large Language Models)

As the length of input text grows, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. We introduce Low Rank Query and Key attention (LRQK), a two-stage framework that jointly decomposes the full-precision query and key matrices into compact rank-\(r\) factors during the prefill stage, and then uses these low-dimensional projections to compute proxy attention scores in \(\mathcal{O}(lr)\) time at each decode step. By selecting only the top-\(k\) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs, thereby preserving exact attention outputs while reducing CPU-GPU data movement. Extensive experiments on the RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal loss in accuracy. Our code is available at https://github.com/tenghuilee/LRQK.

翻译：随着输入文本长度的增加，大语言模型中的键值（KV）缓存会带来高昂的GPU内存开销，并限制了在资源受限设备上进行长上下文推理的能力。现有方法，如KV量化和剪枝，虽能降低内存使用，但存在数值精度损失或键值对保留效果欠佳的问题。本文提出低秩查询与键注意力（LRQK），这是一种两阶段框架：在预填充阶段，将全精度查询矩阵和键矩阵联合分解为紧凑的秩-\(r\)因子；随后在解码阶段，利用这些低维投影以\(\mathcal{O}(lr)\)的时间复杂度计算代理注意力分数。通过仅选取前\(k\)个令牌及少量固定的近期令牌，LRQK采用了一种混合GPU-CPU缓存机制，配合命中-缺失策略，仅传输缺失的全精度KV对，从而在减少CPU-GPU数据传输的同时保持注意力输出的精确性。在RULER和LongBench基准测试中，基于LLaMA-3-8B和Qwen2.5-7B的广泛实验表明，LRQK在长上下文设置下达到或超越了主流稀疏注意力方法的性能，同时显著节省内存，且精度损失极小。我们的代码发布于https://github.com/tenghuilee/LRQK。

相关内容