MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual inputs to vision tokens. However, redundancy in vision tokens results in the degraded inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the coverage criterion. We first formulate the subset selection problem as a maximum coverage problem. Afterwards, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Finally, with only four vision tokens, 87.7% of the original performance is still preserved on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection. The code is available at https://github.com/Ironieser/mmtok

翻译：视觉语言模型通过将视觉输入转换为视觉标记，在结合语言指令理解视觉内容方面展现出卓越性能。然而，视觉标记中的冗余问题导致视觉语言模型的推理效率降低。尽管已有多种算法被提出以减少视觉标记数量，但大多数方法仅利用单模态信息（即视觉或文本）进行剪枝，忽视了视觉语言任务固有的多模态特性。此外，现有方法缺乏可适用于不同模态的通用评估准则。为缓解这一局限，本研究提出利用视觉与文本标记，通过覆盖准则筛选信息丰富的视觉标记。我们首先将子集选择问题形式化为最大覆盖问题，随后优化视觉标记子集，使其同时覆盖文本标记与原始视觉标记集合。所提出的MMTok方法在不同视觉语言模型及基准数据集上进行了广泛评估。对比实验表明，视觉与文本信息具有互补性，结合多模态信息能够以显著优势超越单模态基线方法。进一步地，在POPE数据集的最大覆盖准则下，本方法在LLaVA-NeXT-13B模型上实现了1.87倍的加速，同时保持原始性能的98.7%。此外，在仅使用四个视觉标记的情况下，LLaVA-1.5-7B模型仍能保持原始性能的87.7%。这些结果凸显了覆盖准则在标记选择中的有效性。代码已发布于https://github.com/Ironieser/mmtok