From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information

Despite the impressive capabilities of Multimodal Large Language Models (MLLMs) in integrating text and image modalities, challenges remain in accurately interpreting detailed visual elements. Vision detection models excel at recognizing fine-grained image details, prompting researchers to use them to enhance MLLMs. One effective strategy is to infuse detection information in text format, which has proven simple and effective. However, most studies utilize this method without training, leaving the potential of adaptive training largely unexplored. Adaptive training could significantly enhance MLLMs' comprehension of unique inputs while filtering out irrelevant information. This paper addresses the crucial question: How does training impact MLLMs' understanding of infused textual detection information? We systematically experiment with various representative models to evaluate the effects of training-free, retraining, and fine-tuning strategies. We also examine the influence of training on MLLMs' original abilities and the interchangeability of detection models. Our findings indicate that fine-tuning a pre-trained MLLM to incorporate textual detection information delivers superior results compared to training-free and retraining methods, improving performance by 6.71% across 10 widely recognized benchmarks. Furthermore, fine-tuning enables MLLMs to retain performance enhancements even when detection models are swapped, indicating improved understanding of formatted textual data. We release our codes to support further exploration of fusion strategies for vision detection models and the enhancement of MLLMs' fine-grained multimodal capabilities.

翻译：尽管多模态大语言模型（MLLMs）在整合文本与图像模态方面展现出卓越能力，但在准确解析细粒度视觉元素方面仍面临挑战。视觉检测模型擅长识别精细的图像细节，这促使研究者利用它们来增强MLLMs。一种有效策略是以文本形式注入检测信息，该方法已被证明简单且高效。然而，现有研究大多采用免训练方式使用此方法，自适应训练的潜力尚未得到充分探索。自适应训练能显著提升MLLMs对特定输入的理解能力，同时过滤无关信息。本文探讨一个关键问题：训练如何影响MLLMs对注入的文本检测信息的理解？我们通过系统实验评估免训练、重新训练与微调策略在不同代表性模型上的效果，同时考察训练对MLLMs原始能力的影响及检测模型的可互换性。实验表明，相较于免训练与重新训练方法，对预训练MLLM进行微调以融合文本检测信息能取得更优效果，在10个公认基准测试中平均性能提升6.71%。此外，微调后的MLLMs在更换检测模型时仍能保持性能增益，表明其提升了格式化文本数据的理解能力。我们公开相关代码以支持视觉检测模型融合策略的进一步探索，促进MLLMs细粒度多模态能力的持续增强。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日