Multimodal Large Language Models (MLLMs) achieve strong vision-language reasoning, but long visual contexts enlarge the KV cache and increase decoding latency. Existing compression methods rely on observation window attention for stable token-importance estimation, yet this aggregation can dilute sparse visual evidence and discard answer-critical tokens under aggressive compression. Therefore, we identify last-query attention as a complementary source for recovering such evidence, but its answer-irrelevant signals can mislead retention. We propose BACON, a plug-and-play method that calibrates observation window attention with last-query evidence and suppresses isolated noise via intra-layer coherence and inter-layer persistence. Across diverse benchmarks, models, budgets, and compression methods, BACON improves multimodal KV compression by 7.5% on average under the most aggressive budget, with gains up to 30.9%. Our project page is available at https://ryu1ion.github.io/official_BACON/
翻译:多模态大语言模型(MLLMs)在视觉-语言推理方面表现出色,但长视觉上下文会增大KV缓存并增加解码延迟。现有压缩方法依赖观察窗注意力进行稳定的令牌重要性估计,然而这种聚合可能稀释稀疏的视觉证据,并在激进压缩下丢弃对答案至关重要的令牌。因此,我们发现最后查询注意力可作为恢复此类证据的补充来源,但其与答案无关的信号可能会误导保留策略。我们提出BACON——一种即插即用方法,该方法通过最后查询证据校准观察窗注意力,并利用层内一致性与层间持久性抑制孤立噪声。在多种基准、模型、预算和压缩方法下,BACON在最具挑战性的预算条件下将多模态KV压缩性能平均提升7.5%,最高提升幅度达30.9%。我们的项目页面见https://ryu1ion.github.io/official_BACON/