Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks, but their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs. To address this, we introduce PAR (Prompt-Aware Token Reduction), a novel and plug-and-play approach that reduces visual tokens efficiently without compromising model performance. Unlike previous methods that rely heavily on attention mechanisms and overlooking cross-modal interactions , we uses a prompt-aware strategy to adpative identify and cluster essential visual tokens. PAR categorizes visual context redundancy into two types: external and internal. External redundancy is minimized through semantic retrieval, while internal redundancy is addressed using a token routing mechanism. This method substantially reduces computational load without requiring additional training or complex architectural modifications. \textbf{Experimental results demonstrate that across various visual question answering tasks, PAR reduces FLOPs by 83\% with a compression ratio of 89\%, while retaining 97\% of baseline accuracy.} The adaptive design of PAR achieves a 2x token reduction ratio compared to prior approaches, enabling a better balance between performance and efficiency.
翻译:多模态大语言模型(MLLMs)在各种视觉任务中展现出强大性能,但其效率受到处理多模态输入中长上下文所带来的巨大计算和内存需求的阻碍。为解决此问题,我们提出了PAR(提示感知令牌缩减),一种新颖的即插即用方法,能在不影响模型性能的前提下高效减少视觉令牌。与先前严重依赖注意力机制且忽视跨模态交互的方法不同,我们采用一种提示感知策略来自适应地识别并聚类关键视觉令牌。PAR将视觉上下文冗余分为两类:外部冗余和内部冗余。外部冗余通过语义检索最小化,而内部冗余则通过令牌路由机制处理。此方法显著降低了计算负载,且无需额外训练或复杂的架构修改。\textbf{实验结果表明,在各种视觉问答任务中,PAR能以89\%的压缩率减少83\%的FLOPs,同时保持基线模型97\%的准确率。} PAR的自适应设计实现了相比先前方法2倍的令牌缩减比,从而在性能与效率之间实现了更好的平衡。