Multimodal Entity Linking (MEL) is a task that aims to link ambiguous mentions within multimodal contexts to referential entities in a multimodal knowledge base. Recent methods for MEL adopt a common framework: they first interact and fuse the text and image to obtain representations of the mention and entity respectively, and then compute the similarity between them to predict the correct entity. However, these methods still suffer from two limitations: first, as they fuse the features of text and image before matching, they cannot fully exploit the fine-grained alignment relations between the mention and entity. Second, their alignment is static, leading to low performance when dealing with complex and diverse data. To address these issues, we propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks. DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples. Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach.
翻译:多模态实体链接(MEL)是一项旨在将多模态语境中的模糊指称链接到多模态知识库中参照实体的任务。现有MEL方法通常采用统一框架:首先交互融合文本与图像以分别获取指称和实体的表示,随后计算两者间的相似度以预测正确实体。然而,这些方法仍存在两个局限性:第一,由于在匹配前融合文本与图像特征,无法充分利用指称与实体间的细粒度对齐关系;第二,其对齐方式具有静态性,导致处理复杂多样数据时性能低下。为解决上述问题,我们提出一种名为动态关系交互网络(DRIN)的新颖框架。DRIN显式建模指称与实体间四种不同类型的对齐关系,并构建动态图卷积网络(GCN)以针对不同输入样本动态选取对应对齐关系。在两个数据集上的实验表明,DRIN以显著优势超越现有最优方法,验证了该方法的有效性。