As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.
翻译:随着视觉语言模型(VLM)处理日益复杂的多模态任务,键值(KV)缓存的快速增长在推理过程中带来了显著的内存和计算瓶颈。虽然多头潜在注意力(MLA)为压缩KV缓存和加速推理提供了有效手段,但在无需昂贵预训练的情况下,将现有VLM适配到MLA架构的研究仍处于探索阶段。本研究提出了MHA2MLA-VLM,这是一个参数高效且具有多模态感知能力的框架,用于将现成的VLM转换为MLA架构。我们的方法包含两项核心技术:(1)一种模态自适应的部分旋转位置编码策略,通过选择性屏蔽非必要维度,同时支持传统和多模态场景;(2)一种模态解耦的低秩近似方法,可独立压缩视觉和文本KV空间。此外,我们引入了参数高效微调以最小化适配成本,并证明最小化输出激活误差(而非参数距离)能显著降低性能损失。在三个代表性VLM上进行的大量实验表明,MHA2MLA-VLM能以最少的监督数据恢复原始模型性能,显著降低KV缓存占用,并能与KV量化技术无缝集成。