Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and text on social media contributes to a notable proportion of named entities being ungroundable. 2) There exists a distinction between coarse-grained noun phrases used in similar tasks (e.g., phrase localization) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as connecting bridges. This reformulation brings two benefits: 1) It enables us to optimize the MNER module for optimal MNER performance and eliminates the need to pre-extract region features using object detection methods, thus naturally addressing the two major limitations of existing GMNER methods. 2) The introduction of Entity Expansion Expression module and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). This endows the proposed framework with unlimited data and model scalability. Furthermore, to address the potential ambiguity stemming from the coarse-grained bounding box output in GMNER, we further construct the new Segmented Multimodal Named Entity Recognition (SMNER) task and corresponding Twitter-SMNER dataset aimed at generating fine-grained segmentation masks, and experimentally demonstrate the feasibility and effectiveness of using box prompt-based Segment Anything Model (SAM) to empower any GMNER model with the ability to accomplish the SMNER task. Extensive experiments demonstrate that RiVEG significantly outperforms SoTA methods on four datasets across the MNER, GMNER, and SMNER tasks.

翻译：接地多模态命名实体识别任务旨在识别命名实体、实体类型及其对应的视觉区域。该任务呈现两个挑战性特征：1）社交媒体上图像与文本间的微弱关联导致相当比例的命名实体无法接地；2）相似任务中使用的粗粒度名词短语与细粒度命名实体之间存在差异。本文提出RiVEG统一框架，通过利用大语言模型作为连接桥梁，将GMNER重构为联合MNER-VE-VG任务。此重构带来两大优势：1）能够优化MNER模块以获得最佳性能，并无需使用目标检测方法预提取区域特征，从而自然解决现有GMNER方法的两大局限；2）通过引入实体扩展表达模块和视觉蕴含模块，统一了视觉接地与实体接地机制，赋予该框架无限的数据与模型扩展能力。此外，针对GMNER中粗粒度边界框输出可能导致的歧义问题，本文进一步构建了新的分割多模态命名实体识别任务及对应Twitter-SMNER数据集，旨在生成细粒度分割掩码，并通过实验验证了基于边界框提示的Segment Anything Model能够有效赋能任意GMNER模型完成SMNER任务。大量实验表明，RiVEG在MNER、GMNER和SMNER任务涉及的四个数据集上均显著优于当前最优方法。