Multimodal Entity Linking (MEL) is the task of mapping mentions with multimodal contexts to the referent entities from a knowledge base (e.g. Wikipedia). Existing MEL methods mainly focus on designing complex multimodal interaction mechanisms and require fine-tuning all model parameters, which can be prohibitively costly and difficult to scale in the era of Large Language Models (LLMs). In this work, we propose GEMEL, a simple yet effective Generative Multimodal Entity Linking framework based on LLMs, which directly generates target entity names. We keep the vision and language model frozen and only train a feature mapper to enable cross-modality interactions. To adapt LLMs to the MEL task, we take advantage of the emergent in-context learning capability of LLMs by retrieving multimodal instances as demonstrations. Extensive experiments show that, with only ~0.3% of the model parameters fine-tuned, GEMEL achieves state-of-the-art results on two well-established MEL datasets (7.7% accuracy gains on WikiDiverse and 8.8% accuracy gains on WikiMEL). The performance gain stems from mitigating the popularity bias of LLM predictions and disambiguating less common entities effectively. Further analysis verifies the generality and scalability of GEMEL. Our approach is compatible with any off-the-shelf language model, paving the way towards an efficient and general solution for utilizing LLMs in the MEL task.
翻译:多模态实体链接(MEL)是将具有多模态上下文的提及项映射到知识库(如维基百科)中对应实体的任务。现有MEL方法主要侧重于设计复杂的多模态交互机制,并需要微调所有模型参数,这在大型语言模型(LLM)时代可能成本过高且难以扩展。在本工作中,我们提出GEMEL——一种基于LLM的简洁而有效的生成式多模态实体链接框架,可直接生成目标实体名称。我们冻结视觉和语言模型,仅训练特征映射器以实现跨模态交互。为使LLM适配MEL任务,我们利用LLM涌现的上下文学习能力,通过检索多模态实例作为演示。大量实验表明,仅微调约0.3%的模型参数,GEMEL在两个成熟的MEL数据集上取得了最先进的结果(WikiDiverse准确率提升7.7%,WikiMEL准确率提升8.8%)。性能提升源于有效缓解了LLM预测的流行度偏差,并成功消解了低频实体的歧义。进一步分析验证了GEMEL的普适性与可扩展性。我们的方法兼容任何现成的语言模型,为在MEL任务中高效通用地利用LLM铺平了道路。