Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.
翻译:多模态大语言模型(MLLMs)通过将视觉编码器与语言模型对齐,赋予语言模型视觉能力。现有增强MLLMs视觉感知的方法通常涉及设计更强大的视觉编码器,这需要探索庞大的设计空间并将每个潜在编码器与语言模型重新对齐,导致训练成本极高。本文提出VisionFuse,一种新颖的集成框架,能够高效利用现有多模态大语言模型中的多个视觉编码器来增强视觉感知,且无需额外训练。我们的方法基于以下观察:给定相同查询和图像时,不同的MLLMs倾向于关注不同的区域。此外,我们发现同一MLLM家族(即共享同一预训练语言模型的一组MLLMs)中视觉编码器的特征分布高度对齐。基于这些发现,VisionFuse通过拼接选定家族内MLLMs视觉编码器生成的token来丰富视觉上下文。通过合并这些MLLMs的语言模型参数,VisionFuse使单个语言模型能够与多种视觉编码器对齐,显著降低部署开销。我们在多个多模态基准测试中使用不同MLLM组合进行全面评估,结果表明在多模态任务上取得显著提升。值得注意的是,当集成MiniGemini-8B和SLIME-8B时,VisionFuse实现了平均超过4%的性能提升。