Transformer pretraining is increasingly constrained by memory and compute requirements, with the key-value (KV) cache emerging as a dominant bottleneck during training and autoregressive decoding. We propose \textit{low-rank KV adaptation} (LRKV), a simple modification of multi-head attention that reduces KV cache memory by exploiting redundancy across attention heads while preserving full token-level resolution. Each layer uses a shared full-rank KV projection augmented with low-rank, head-specific residuals, yielding a continuous trade-off between complete sharing and fully independent attention. LRKV is a drop-in replacement for standard multi-head attention and directly subsumes query-sharing approaches such as multi-query and grouped-query attention, while remaining distinct from latent-compression methods such as multi-latent attention (MLA). Across large-scale pretraining experiments, LRKV consistently achieves faster loss reduction, lower validation perplexity, and stronger downstream task performance than standard attention, MQA/GQA, and MLA. At the 2.5B scale, LRKV outperforms standard attention while using roughly half the KV cache, and reaches equivalent model quality with up to \textbf{20-25\% less training compute} when measured in cumulative FLOPs. To explain these gains, we analyze attention head structure in operator space and show that LRKV preserves nearly all functional head diversity relative to standard attention, whereas more aggressive KV-sharing mechanisms rely on compensatory query specialization. Together, these results establish LRKV as a practical and effective attention mechanism for scaling Transformer pretraining under memory- and compute-constrained regimes.
翻译:Transformer预训练日益受到内存和计算需求的限制,其中键值(KV)缓存已成为训练和自回归解码过程中的主要瓶颈。我们提出\textit{低秩键值适配}(LRKV),一种对多头注意力的简单修改,通过利用注意力头之间的冗余来减少KV缓存内存,同时保持完整的词元级分辨率。每一层使用共享的全秩KV投影,并辅以低秩、头特定的残差,从而在完全共享与完全独立的注意力之间实现连续权衡。LRKV可直接替代标准多头注意力,并直接包含查询共享方法(如多查询和分组查询注意力),同时区别于潜在压缩方法(如多潜在注意力)。在大规模预训练实验中,与标准注意力、MQA/GQA和MLA相比,LRKV始终实现更快的损失下降、更低的验证困惑度和更强的下游任务性能。在25亿参数规模下,LRKV在使用约一半KV缓存的同时优于标准注意力,并在以累积FLOPs衡量时,以\textbf{减少20-25\%的训练计算量}达到同等模型质量。为解释这些优势,我们在算子空间中分析注意力头结构,表明LRKV相对于标准注意力保留了几乎全部的功能性头多样性,而更激进的KV共享机制则依赖于补偿性查询特化。总之,这些结果确立了LRKV作为一种实用且有效的注意力机制,适用于在内存和计算受限条件下扩展Transformer预训练。