Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit $\textbf{modality bias}$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning ($\textbf{MCR}$), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.
翻译:接地多模态命名实体识别(GMNER)旨在提取基于文本的实体,为其分配语义类别,并将其接地到相应的视觉区域。在本工作中,我们探索了多模态大语言模型(MLLM)以端到端方式执行GMNER的潜力,超越其在级联管道中作为辅助工具的典型角色。关键的是,我们的研究揭示了一个根本性挑战:MLLM表现出$\textbf{模态偏差}$,包括视觉偏差和文本偏差,这源于其倾向于采取单模态捷径而非严格的跨模态验证。为解决此问题,我们提出了模态感知一致性推理($\textbf{MCR}$),该方法通过多风格推理模式注入(MRSI)和约束引导可验证优化(CVO)来强制执行结构化的跨模态推理。MRSI将抽象约束转化为可执行的推理链,而CVO则使模型能够通过组相对策略优化(GRPO)动态对齐其推理轨迹。在GMNER和视觉接地任务上的实验表明,与现有基线相比,MCR有效缓解了模态偏差并实现了更优的性能。