CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

from arxiv, Accepted at ICLR 2026. Conference paper. 10 pages main text; 34 pages total including references and appendix. 11 figures and 20 tables in total

Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With a brief post-SVD healing fine-tune, we fully recover the original model's accuracy.

翻译：将预训练注意力模块（如分组查询注意力GQA）转换为多头潜在注意力MLA，可在不增加KV缓存成本的前提下提升表达能力，因而在高效推理中具有吸引力。然而，许多实用转换基线方法依赖纯权重的低秩近似（例如SVD风格初始化）和均匀秩分配。这些方法侧重于最小化权重矩阵间的差异，而非权重如何影响输入激活值；忽略了激活值的协方差结构，并在各层之间强制施加均匀秩，导致激活值漂移与注意力保真度下降。为解决这些问题，我们提出CARE——一种在固定KV宽度下兼具协方差感知与秩增强的MLA转换流程。CARE引入三个关键步骤：(i) 保持激活值的分解方法，该方法使近似与实际输入激活值对齐，而非仅关注权重；(ii) 调整秩分配策略，通过将固定KV预算分配给最需要更多容量的层来跨层分配；(iii) KV对等映射，通过重新参数化转换后的K和V以适配MLA格式，同时保持KV缓存大小不变。我们的方法在Qwen3-4B/30B-A3B-Instruct-2507及Llama-3.1-8B/70B-Instruct模型上优于均匀秩SVD基线，在匹配KV预算条件下，将一次性困惑度降低高达215倍，平均准确率提升高达1.70倍。通过简短的SVD后修复微调，我们完整恢复了原始模型的准确率。