Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs. Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers. To pinpoint the ideal layer for VTW, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion. Our VTW approach can cut computational overhead by over 40\% across diverse multimodal tasks while maintaining performance. Our code is released at \url{https://github.com/lzhxmu/VTW}.
翻译:多模态大语言模型(MLLMs)由于参数量庞大且需要额外的输入令牌来表示视觉信息,其推理过程需要大量计算。本文提出视觉令牌撤回(VTW),一种即插即用模块,旨在提升MLLMs的快速推理能力。我们的方法受到两个有趣现象的启发:(1)在LLMs中普遍存在的注意力汇聚现象同样存在于MLLMs中,表明初始令牌和最近令牌获得了大部分注意力,而中间视觉令牌在深层网络中获得的注意力极少;(2)信息迁移现象的存在,意味着视觉信息在MLLMs的前几层中已转移至后续文本令牌。根据这些发现,我们得出结论:视觉令牌在MLLMs的深层网络中是不必要的。因此,我们在特定层策略性地撤回这些令牌,使得后续层仅由文本令牌参与计算。为确定VTW的最佳层,我们首先分析一组有限的小规模数据集,并选择首个满足Kullback-Leibler散度准则的层。我们的VTW方法能在保持性能的同时,将多种多模态任务的计算开销降低超过40%。代码发布于 \url{https://github.com/lzhxmu/VTW}。