Despite the impressive capabilities of Multimodal Large Language Models (MLLMs) in integrating text and image modalities, challenges remain in accurately interpreting detailed visual elements. Vision detection models excel at recognizing fine-grained image details, prompting researchers to use them to enhance MLLMs. One effective strategy is to infuse detection information in text format, which has proven simple and effective. However, most studies utilize this method without training, leaving the potential of adaptive training largely unexplored. Adaptive training could significantly enhance MLLMs' comprehension of unique inputs while filtering out irrelevant information. This paper addresses the crucial question: How does training impact MLLMs' understanding of infused textual detection information? We systematically experiment with various representative models to evaluate the effects of training-free, retraining, and fine-tuning strategies. We also examine the influence of training on MLLMs' original abilities and the interchangeability of detection models. Our findings indicate that fine-tuning a pre-trained MLLM to incorporate textual detection information delivers superior results compared to training-free and retraining methods, improving performance by 6.71% across 10 widely recognized benchmarks. Furthermore, fine-tuning enables MLLMs to retain performance enhancements even when detection models are swapped, indicating improved understanding of formatted textual data. We release our codes to support further exploration of fusion strategies for vision detection models and the enhancement of MLLMs' fine-grained multimodal capabilities.
翻译:尽管多模态大语言模型(MLLMs)在整合文本与图像模态方面展现出卓越能力,但在准确解析细粒度视觉元素方面仍面临挑战。视觉检测模型擅长识别精细的图像细节,这促使研究者利用它们来增强MLLMs。一种有效策略是以文本形式注入检测信息,该方法已被证明简单且高效。然而,现有研究大多采用免训练方式使用此方法,自适应训练的潜力尚未得到充分探索。自适应训练能显著提升MLLMs对特定输入的理解能力,同时过滤无关信息。本文探讨一个关键问题:训练如何影响MLLMs对注入的文本检测信息的理解?我们通过系统实验评估免训练、重新训练与微调策略在不同代表性模型上的效果,同时考察训练对MLLMs原始能力的影响及检测模型的可互换性。实验表明,相较于免训练与重新训练方法,对预训练MLLM进行微调以融合文本检测信息能取得更优效果,在10个公认基准测试中平均性能提升6.71%。此外,微调后的MLLMs在更换检测模型时仍能保持性能增益,表明其提升了格式化文本数据的理解能力。我们公开相关代码以支持视觉检测模型融合策略的进一步探索,促进MLLMs细粒度多模态能力的持续增强。