ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task

Large Multimodal Models (LMMs) are powerful tools that are capable of reasoning and understanding multimodal information beyond text and language. Despite their entrenched impact, the development of LMMs is hindered by the higher computational requirements compared to their unimodal counterparts. One of the main causes of this is the large amount of tokens needed to encode the visual input, which is especially evident for multi-image multimodal tasks. Recent approaches to reduce visual tokens depend on the visual encoder architecture, require fine-tuning the LLM to maintain the performance, and only consider single-image scenarios. To address these limitations, we propose ToFu, a visual encoder-agnostic, training-free Token Fusion strategy that combines redundant visual tokens of LMMs for high-resolution, multi-image, tasks. The core intuition behind our method is straightforward yet effective: preserve distinctive tokens while combining similar ones. We achieve this by sequentially examining visual tokens and deciding whether to merge them with others or keep them as separate entities. We validate our approach on the well-established LLaVA-Interleave Bench, which covers challenging multi-image tasks. In addition, we push to the extreme our method by testing it on a newly-created benchmark, ComPairs, focused on multi-image comparisons where a larger amount of images and visual tokens are inputted to the LMMs. Our extensive analysis, considering several LMM architectures, demonstrates the benefits of our approach both in terms of efficiency and performance gain.

翻译：大型多模态模型（LMMs）是能够推理和理解超越文本与语言的多模态信息的强大工具。尽管其影响深远，但相较于单模态模型，LMMs的发展受限于更高的计算需求。其主要原因之一在于编码视觉输入所需的大量令牌，这在多图像多模态任务中尤为明显。现有的视觉令牌缩减方法依赖于视觉编码器架构，需要通过微调大型语言模型以保持性能，且仅考虑单图像场景。为应对这些局限，我们提出ToFu——一种与视觉编码器无关、无需训练的令牌融合策略，该策略通过合并LMMs中的冗余视觉令牌，以处理高分辨率、多图像任务。我们方法的核心思路简洁而有效：保留具有区分性的令牌，同时合并相似的令牌。我们通过顺序检查视觉令牌，并决定将其与其他令牌融合或保持为独立实体来实现这一目标。我们在成熟的LLaVA-Interleave基准测试上验证了我们的方法，该基准涵盖具有挑战性的多图像任务。此外，我们通过在新构建的专注于多图像比较任务的基准测试ComPairs上进行测试，将我们的方法推向极致——该任务需要向LMMs输入更大量的图像和视觉令牌。我们基于多种LMM架构的广泛分析表明，该方法在效率和性能提升方面均具有显著优势。