Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach can reduce up to 80% of visual tokens while maintaining performance in long context settings.
翻译:大型多模态模型(LMMs)已在多种任务中被证明是有效的。它们通常将视觉输入编码为原始模型令牌序列,然后与文本令牌拼接,并由语言模型联合处理。然而,视觉令牌数量的增长极大地增加了推理成本。视觉令牌剪枝已成为一种有前景的解决方案。然而,现有方法往往忽视了涉及多图像的长上下文输入场景。本文分析了长上下文、多图像设置下视觉令牌剪枝面临的挑战,并针对此类场景引入了一种自适应剪枝方法。我们将冗余分解为图像内与图像间两个组成部分,并通过图像内多样性与图像间差异性对其进行量化,二者共同指导动态预算分配。我们的方法包含两个阶段。图像内阶段为每张图像分配一个内容感知的令牌预算,并贪婪地选择其最具代表性的令牌。图像间阶段执行全局多样性过滤以形成候选池,随后应用帕累托选择程序以平衡多样性与文本对齐。大量实验表明,我们的方法在长上下文设置中可减少高达80%的视觉令牌,同时保持性能。