Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.
翻译:多模态大语言模型(MLLMs)因视觉令牌数量过多而面临高昂计算成本,在高分辨率及视频场景中尤为突出。现有令牌压缩方法通常聚焦于孤立流程环节,且常忽视文本对齐,导致性能下降。本文提出VisionTrim,一个用于免训练MLLM加速的统一框架,其整合了两个即插即用的高效模块:1)主导视觉令牌选择(DVTS)模块,通过全局-局部视图保留关键视觉令牌;2)文本引导视觉补充(TGVC)模块,在文本线索引导下实现上下文感知的令牌融合。在多样化图像与视频多模态基准测试上的大量实验表明,VisionTrim具有显著的性能优势,推动了MLLM在实际应用中的部署。代码已开源:https://github.com/hanxunyu/VisionTrim。