Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are not necessary in the deep layers of MLLMs. Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers. To pinpoint the ideal layer for vision tokens withdrawal, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion. Our VTW approach can cut computational overhead by over 40\% across diverse multimodal tasks while maintaining performance. Our code is released at https://github.com/lzhxmu/VTW.
翻译:多模态大语言模型(MLLMs)因参数庞大以及需引入额外输入令牌来表示视觉信息,导致推理过程中计算量巨大。为此,我们提出视觉令牌撤销(VTW)模块,这是一种即插即用的组件,可加速MLLMs的推理。该方法灵感来自我们观察到的两个有趣现象:(1)在LLMs中普遍存在的注意力汇聚现象也存在于MLLMs中,表明初始令牌和最近邻令牌获得了大部分注意力,而深层中间视觉令牌的注意力极少;(2)存在信息迁移现象,即视觉信息在MLLMs的前几层就被转移到后续文本令牌中。基于这些发现,我们推断视觉令牌在MLLMs的深层并非必要。因此,我们在特定层撤销视觉令牌,使后续层仅关注文本令牌。为确定撤销视觉令牌的最佳层,我们首先在少量小型数据集上进行分析,并选择满足Kullback-Leibler散度判据的首个层。VTW方法可在保持性能的同时,将多种多模态任务的计算开销降低40%以上。我们的代码已开源至https://github.com/lzhxmu/VTW。