Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities in various multi-modal tasks. Nevertheless, their performance in fine-grained image understanding tasks is still limited. To address this issue, this paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. Specifically, we present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. A self-consistent bootstrapping method is also introduced to extend existing dense object annotations into high-quality referring-expression-bounding-box pairs. These methods enable the generation of high-quality instruction data which includes a wide range of fundamental abilities essential for fine-grained image perception. Moreover, we argue that the visual encoder should be tuned during instruction tuning to mitigate the gap between full image perception and fine-grained image perception. Experimental results demonstrate the superior performance of our method. For instance, our model exhibits a 5.2% accuracy improvement over Qwen-VL on GQA and surpasses the accuracy of Kosmos-2 by 24.7% on RefCOCO_val. We have also attained the top rank on the leaderboard of MMBench. This promising performance is achieved by training on only publicly available data, making it easily reproducible. The models, datasets, and codes are publicly available at https://github.com/SY-Xuan/Pink.
翻译:多模态大语言模型(MLLMs)在各种多模态任务中展现了显著能力。然而,它们在细粒度图像理解任务中的表现仍存在局限。为解决这一问题,本文提出一种新框架,用于增强MLLMs的细粒度图像理解能力。具体而言,我们提出一种低成本构建指令微调数据集的新方法,通过利用现有数据集中的标注信息实现。同时引入一种自一致性自举方法,将现有的密集目标标注扩展为高质量的指代表达-边界框对。这些方法能够生成包含细粒度图像感知所需多种基础能力的高质量指令数据。此外,我们论证了在指令微调过程中应当调整视觉编码器,以弥合全局图像感知与细粒度图像感知之间的差距。实验结果表明了本方法的优越性。例如,我们的模型在GQA上相比Qwen-VL实现了5.2%的准确率提升,并在RefCOCO_val上超过Kosmos-2达24.7%的准确率。我们还在MMBench排行榜上取得了最高排名。这一优异性能仅通过使用公开可用数据训练实现,因此易于复现。模型、数据集及代码已公开于https://github.com/SY-Xuan/Pink。