Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model's overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.
翻译:多模态大语言模型(MLLMs)在单图像定位和通用多图像理解方面已展现出显著进展。近期,部分方法开始尝试解决多图像定位问题。然而,由于缺乏对广义定位任务的统一建模,现有方法受限于单目标定位及有限的实际任务类型。为此,我们提出GeM-VG——一种能够进行广义多图像视觉定位的MLLM。为支撑此研究,我们依据任务对跨图像线索与推理的依赖程度,系统性地对现有多图像定位任务进行了分类与梳理,并引入了MG-Data-240K数据集,以解决现有数据集在目标数量与图像关系方面的局限性。为应对多样化多图像定位任务的鲁棒处理挑战,我们进一步提出了一种融合思维链推理与直接回答的混合强化微调策略,该策略综合考虑了二者的互补优势。该策略采用由精心设计的基于规则的奖励引导的R1类算法,有效提升了模型的整体感知与推理能力。大量实验证明了我们模型卓越的广义定位能力。在多图像定位任务上,其在MIG-Bench和MC-Bench基准上分别以2.0%和9.7%的优势超越了先前领先的MLLMs。在单图像定位任务中,其在ODINW基准上相比基础模型实现了9.1%的性能提升。此外,我们的模型在通用多图像理解方面仍保持强大能力。