Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.

翻译：多模态大语言模型（MLLMs）因视觉编码器中视觉令牌数量庞大而面临高昂的推理成本。冗余的视觉令牌导致巨大的计算负载和键值（KV）缓存占用瓶颈。现有方法聚焦于令牌级优化，利用各种复杂的令牌剪枝技术来消除非关键视觉令牌。然而，这些方法往往不可避免地损害KV缓存的完整性，导致长文本生成任务失败。为此，我们从新视角对模型的注意力机制进行了深入研究，发现超过半数的解码层中的注意力在语义上具有相似性。基于这一发现，我们认为某些层的注意力可以通过继承其前序层的注意力进行简化。因此，我们提出Lazy Attention，一种高效的注意力机制，支持跨层共享相似的注意力模式。它巧妙地减少了注意力中逐层的冗余计算。在Lazy Attention中，我们为MLLMs设计了一种新颖的层共享缓存——Q Cache，实现了相邻层间查询的重用。特别地，Q Cache是轻量级的，且与现有推理框架（包括Flash Attention和KV缓存）完全兼容。此外，我们的方法具有高度灵活性，因其与现有令牌级技术正交，可独立部署或与令牌剪枝方法结合使用。在多个基准测试上的实证评估表明，我们的方法能够减少超过35%的KV缓存使用，并实现1.5倍的吞吐量提升，而仅在多种MLLMs上牺牲约1%的性能。与最先进的令牌级方法相比，我们的技术在精度保持方面表现更优。