Despite the impressive capabilities of Multimodal Large Language Models (MLLMs) in integrating text and image modalities, challenges remain in accurately interpreting detailed visual elements. This paper presents an empirical study on enhancing MLLMs with state-of-the-art (SOTA) object detection and Optical Character Recognition models to improve fine-grained image understanding and reduce hallucination in responses. Our research investigates the embedding-based infusion of detection information, the impact of such infusion on the MLLMs' original abilities, and the interchangeability of detection models. We conduct systematic experiments with models such as LLaVA-1.5, DINO, and PaddleOCRv2, revealing that our approach not only refines MLLMs' performance in specific visual tasks but also maintains their original strengths. The resulting enhanced MLLMs outperform SOTA models on 9 out of 10 benchmarks, achieving an improvement of up to 12.99% on the normalized average score, marking a notable advancement in multimodal understanding. We release our codes to facilitate further exploration into the fine-grained multimodal dialogue capabilities of MLLMs.
翻译:尽管多模态大语言模型在整合文本与图像模态方面展现出卓越能力,但在准确解读细粒度视觉元素方面仍存在挑战。本文提出一项实证研究,通过整合最先进的目标检测与光学字符识别模型来增强多模态大语言模型,以提升细粒度图像理解能力并减少响应中的幻觉现象。我们的研究探索了检测信息的嵌入式注入方式、此类注入对多模态大语言模型原有能力的影响,以及检测模型的可互换性。我们基于LLaVA-1.5、DINO和PaddleOCRv2等模型开展系统实验,结果表明该方法不仅优化了多模态大语言模型在特定视觉任务中的表现,同时保持了其原有优势。增强后的多模态大语言模型在10项基准测试中的9项上超越现有最优模型,归一化平均分数提升高达12.99%,标志着多模态理解领域的显著进步。我们开放相关代码以促进对多模态大语言模型细粒度对话能力的进一步探索。