Multimodal Entity Linking (MEL) is the task of mapping mentions with multimodal contexts to the referent entities from a knowledge base (e.g., Wikipedia). Prior MEL methods mainly focus on designing complex multimodal interaction mechanisms and require fine-tuning all model parameters, which can be prohibitively costly and difficult to scale in the era of Large Language Models (LLMs). In this work, we propose GEMEL, a simple yet effective Generative Multimodal Entity Linking method, which leverages the capabilities of LLMs from large-scale pre-training to directly generate target entity names. We keep the vision and language model frozen and only train a linear layer to enable cross-modality interactions. To adapt LLMs to the MEL task, we take advantage of the emerging in-context learning (ICL) capability of LLMs by retrieving multimodal instances as demonstrations. Extensive experiments show that with only ~0.3% of the model parameters fine-tuned, GEMEL achieves state-of-the-art results on two well-established MEL datasets (4.1% accuracy gains on WikiDiverse and 15.4% accuracy gains on WikiMEL). Our approach is compatible with any off-the-shelf language model, paving the way towards an efficient and general solution for utilizing LLMs in the MEL task.
翻译:多模态实体链接(MEL)是一项将具有多模态上下文的提及项映射至知识库(如维基百科)中对应实体实体的任务。现有MEL方法主要聚焦于设计复杂的多模态交互机制,并需要微调全部模型参数,这在大型语言模型(LLM)时代可能带来高昂成本且难以扩展。本文提出GEMEL,一种简洁高效的多模态实体链接生成式方法,其利用LLM在大规模预训练中的能力直接生成目标实体名称。我们冻结视觉与语言模型,仅训练一个线性层以实现跨模态交互。为适配MEL任务,我们通过检索多模态实例作为示例,充分利用LLM新兴的上下文学习(ICL)能力。大量实验表明,仅需微调约0.3%的模型参数,GEMEL即可在两个权威MEL数据集上取得最优结果(WikiDiverse准确率提升4.1%,WikiMEL准确率提升15.4%)。该方法兼容任意现成语言模型,为在MEL任务中高效、通用地利用LLM开辟了新路径。