UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models

Multimodal Entity Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to the referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods focus heavily on using complex mechanisms and extensive model tuning methods to model the multimodal interaction on specific datasets. However, these methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. Moreover, these methods can not solve the issues like textual ambiguity, redundancy, and noisy images, which severely degrade their performance. Fortunately, the advent of Large Language Models (LLMs) with robust capabilities in text understanding and reasoning, particularly Multimodal Large Language Models (MLLMs) that can process multimodal inputs, provides new insights into addressing this challenge. However, how to design a universally applicable LLMs-based MEL approach remains a pressing challenge. To this end, we propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using LLMs. In this framework, we employ LLMs to augment the representation of mentions and entities individually by integrating textual and visual information and refining textual information. Subsequently, we employ the embedding-based method for retrieving and re-ranking candidate entities. Then, with only ~0.26% of the model parameters fine-tuned, LLMs can make the final selection from the candidate entities. Extensive experiments on three public benchmark datasets demonstrate that our solution achieves state-of-the-art performance, and ablation studies verify the effectiveness of all modules. Our code is available at https://anonymous.4open.science/r/UniMEL/.

翻译：多模态实体链接（MEL）是一项关键任务，旨在将多模态上下文中的模糊提及链接到多模态知识库（如维基百科）中的参照实体。现有方法主要侧重于使用复杂的机制和广泛的模型调优方法，在特定数据集上建模多模态交互。然而，这些方法过度复杂化了MEL任务，且忽视了视觉语义信息，导致其成本高昂且难以扩展。此外，这些方法无法解决文本歧义、冗余和噪声图像等问题，严重影响了其性能。幸运的是，具备强大文本理解和推理能力的大语言模型（LLMs）的出现，特别是能够处理多模态输入的多模态大语言模型（MLLMs），为解决这一挑战提供了新的思路。然而，如何设计一种普遍适用的基于LLMs的MEL方法仍是一个紧迫的挑战。为此，我们提出了UniMEL，一个统一的框架，它建立了一种使用LLMs处理多模态实体链接任务的新范式。在该框架中，我们利用LLMs通过整合文本与视觉信息并精炼文本信息，分别增强提及和实体的表示。随后，我们采用基于嵌入的方法进行候选实体的检索和重排序。然后，仅需对约0.26%的模型参数进行微调，LLMs即可从候选实体中做出最终选择。在三个公共基准数据集上的大量实验表明，我们的解决方案实现了最先进的性能，消融研究验证了所有模块的有效性。我们的代码可在 https://anonymous.4open.science/r/UniMEL/ 获取。