While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts. We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, leading to hallucinations. To address this, we introduce V-Skip that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem. V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow, effectively rescuing visually salient anchors. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a $2.9\times$ speedup with negligible accuracy loss. Specifically, it preserves fine-grained visual details, outperforming other baselines over 30\% on the DocVQA.
翻译:尽管思维链(CoT)推理显著提升了多模态大语言模型(MLLMs)的性能,但其自回归特性导致了难以承受的延迟限制。当前通过令牌压缩来缓解这一问题的尝试常常失败,原因在于盲目地将以文本为中心的度量标准应用于多模态场景。我们发现了一种关键失效模式,称为"视觉遗忘",即语言上冗余的令牌被错误地剪枝,从而导致幻觉生成。为解决此问题,我们提出了V-Skip方法,它将令牌剪枝重新表述为一个视觉锚定信息瓶颈(VA-IB)优化问题。V-Skip采用一种双路门控机制,通过语言惊异值和跨模态注意力流共同权衡令牌的重要性,从而有效保留视觉显著性锚点。在Qwen2-VL和Llama-3.2系列模型上进行的大量实验表明,V-Skip在实现$2.9\times$加速的同时,精度损失可忽略不计。具体而言,该方法能保留细粒度的视觉细节,在DocVQA基准上以超过30%的优势优于其他基线方法。