Multimodal Entity Linking (MEL) is the task of mapping mentions with multimodal contexts to the referent entities from a knowledge base. Existing MEL methods mainly focus on designing complex multimodal interaction mechanisms and require fine-tuning all model parameters, which can be prohibitively costly and difficult to scale in the era of Large Language Models (LLMs). In this work, we propose GEMEL, a Generative Multimodal Entity Linking framework based on LLMs, which directly generates target entity names. We keep the vision and language model frozen and only train a feature mapper to enable cross-modality interactions. To adapt LLMs to the MEL task, we leverage the in-context learning capability of LLMs by retrieving multimodal instances as demonstrations. Extensive experiments show that, with only ~0.3% of the model parameters fine-tuned, GEMEL achieves state-of-the-art results on two well-established MEL datasets (7.7% accuracy gains on WikiDiverse and 8.8% accuracy gains on WikiMEL). The performance gain stems from mitigating the popularity bias of LLM predictions and disambiguating less common entities effectively. Further analysis verifies the generality and scalability of GEMEL. Our framework is compatible with any off-the-shelf language model, paving the way towards an efficient and general solution for utilizing LLMs in the MEL task.
翻译:多模态实体链接(MEL)是将具有多模态上下文的提及映射到知识库中指代实体的任务。现有MEL方法主要侧重于设计复杂的多模态交互机制,并需要微调所有模型参数,这在大语言模型(LLM)时代成本极高且难以扩展。本文提出GEMEL——一种基于LLM的生成式多模态实体链接框架,可直接生成目标实体名称。我们冻结视觉和语言模型,仅训练特征映射器以实现跨模态交互。为适配LLMs到MEL任务,我们通过检索多模态实例作为示范,利用LLM的上下文学习能力。大量实验表明,仅微调约0.3%的模型参数,GEMEL就在两个权威MEL数据集上取得了最优结果(WikiDiverse准确率提升7.7%,WikiMEL准确率提升8.8%)。性能提升源于缓解了LLM预测中的流行度偏差,并有效消歧了低频实体。进一步分析验证了GEMEL的通用性与可扩展性。我们的框架兼容任意现成语言模型,为在MEL任务中高效利用LLMs开辟了通用性解决方案的路径。