The integration of visual information into Large Language Models (LLMs) has enabled Multimodal LLMs (MLLMs), but the quadratic memory and computational costs of Transformer architectures remain a bottleneck. Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens, leading to suboptimal efficiency or degraded performance. In this paper, we propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimizes text-visual token interaction in MLLMs by implementing Dual-Attention Pruning during pre-filling (leveraging visual token sparsity and attention variance) and a Dynamic Decoding Eviction Strategy (inspired by OS Recycle Bins) during decoding. HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds compared to greedy strategies, enhancing efficiency in both comprehension and generation tasks. Empirically, HAE reduces KV-Cache memory by 41\% with minimal accuracy loss (0.3\% drop) in image understanding tasks and accelerates story generation inference by 1.5x while maintaining output quality on Phi3.5-Vision-Instruct model.
翻译:将视觉信息整合到大型语言模型(LLMs)中催生了多模态大型语言模型(MLLMs),但Transformer架构的二次方内存与计算成本仍是瓶颈。现有的KV缓存淘汰策略未能解决视觉与文本token之间异构的注意力分布问题,导致效率欠佳或性能下降。本文提出层次自适应淘汰(HAE),一种KV缓存淘汰框架,通过在前填充阶段实施双注意力剪枝(利用视觉token稀疏性与注意力方差)以及在解码阶段采用动态解码淘汰策略(受操作系统回收站启发),优化MLLMs中的文本-视觉token交互。HAE最小化各层的KV缓存使用,通过索引广播降低计算开销,并在理论上确保相较于贪婪策略更优的信息完整性与更低误差界,从而提升理解与生成任务的效率。实验表明,在Phi3.5-Vision-Instruct模型上,HAE在图像理解任务中将KV缓存内存减少41%且精度损失极小(下降0.3%),并在故事生成推理中加速1.5倍的同时保持输出质量。