Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging properties: 1) The weak correlation between image-text pairs in social media results in a significant portion of named entities being ungroundable. 2) There exists a distinction between coarse-grained referring expressions commonly used in similar tasks (e.g., phrase localization, referring expression comprehension) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as a connecting bridge. This reformulation brings two benefits: 1) It maintains the optimal MNER performance and eliminates the need for employing object detection methods to pre-extract regional features, thereby naturally addressing two major limitations of existing GMNER methods. 2) The introduction of entity expansion expression and Visual Entailment (VE) Module unifies Visual Grounding (VG) and Entity Grounding (EG). It enables RiVEG to effortlessly inherit the Visual Entailment and Visual Grounding capabilities of any current or prospective multimodal pretraining models. Extensive experiments demonstrate that RiVEG outperforms state-of-the-art methods on the existing GMNER dataset and achieves absolute leads of 10.65%, 6.21%, and 8.83% in all three subtasks.
翻译:有基础的多模态命名实体识别(GMNER)是一项新兴的多模态任务,旨在识别命名实体、实体类型及其对应的视觉区域。GMNER任务呈现出两个具有挑战性的特性:1)社交媒体中图像-文本对的弱相关性导致大量命名实体无法被基础化;2)相似任务中常用的粗粒度指代表达(如短语定位、指代表达理解)与细粒度命名实体之间存在差异。本文提出RiVEG,这是一个统一框架,通过利用大型语言模型(LLMs)作为连接桥梁,将GMNER重构为联合的MNER-VE-VG任务。这种重构带来两个优势:1)它保持了最优的MNER性能,并消除了使用目标检测方法预提取区域特征的需求,从而自然解决了现有GMNER方法的两个主要局限性;2)实体扩展表达和视觉蕴含(VE)模块的引入统一了视觉基础(VG)与实体基础(EG)。这使得RiVEG能够轻松继承当前或未来任何多模态预训练模型的视觉蕴含与视觉基础能力。大量实验表明,RiVEG在现有GMNER数据集上优于最先进的方法,并在三个子任务中分别实现了10.65%、6.21%和8.83%的绝对领先。