Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging properties: 1) The weak correlation between image-text pairs in social media results in a significant portion of named entities being ungroundable. 2) There exists a distinction between coarse-grained referring expressions commonly used in similar tasks (e.g., phrase localization, referring expression comprehension) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as a connecting bridge. This reformulation brings two benefits: 1) It maintains the optimal MNER performance and eliminates the need for employing object detection methods to pre-extract regional features, thereby naturally addressing two major limitations of existing GMNER methods. 2) The introduction of entity expansion expression and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). It enables RiVEG to effortlessly inherit the Visual Entailment and Visual Grounding capabilities of any current or prospective multimodal pretraining models. Extensive experiments demonstrate that RiVEG outperforms state-of-the-art methods on the existing GMNER dataset and achieves absolute leads of 10.65%, 6.21%, and 8.83% in all three subtasks.
翻译:基于视觉的多模态命名实体识别(GMNER)是一项新兴的多模态任务,旨在识别命名实体、实体类型及其对应的视觉区域。GMNER任务呈现出两个具有挑战性的特性:1)社交媒体中图文对之间的弱相关性导致大量命名实体无法被视觉定位;2)相似任务(如短语定位、指代表达式理解)中常用的粗粒度指代表达与细粒度命名实体之间存在差异。本文提出RiVEG,一个通过利用大语言模型(LLMs)作为连接桥梁、将GMNER重构为联合MNER-VE-VG任务的统一框架。该重构带来两个优势:1)保持最优的MNER性能,无需采用目标检测方法预提取区域特征,从而自然解决了现有GMNER方法的两个主要局限;2)通过引入实体扩展表达和视觉蕴含(VE)模块,统一了视觉定位(VG)与实体定位(EG)。这使得RiVEG能够轻松继承任何现有或未来多模态预训练模型的视觉蕴含与视觉定位能力。大量实验表明,RiVEG在现有GMNER数据集上优于最先进方法,并在全部三个子任务中分别取得10.65%、6.21%和8.83%的绝对领先优势。