Visual Word Sense Disambiguation (VWSD) is a multi-modal task that aims to select, among a batch of candidate images, the one that best entails the target word's meaning within a limited context. In this paper, we propose a multi-modal retrieval framework that maximally leverages pretrained Vision-Language models, as well as open knowledge bases and datasets. Our system consists of the following key components: (1) Gloss matching: a pretrained bi-encoder model is used to match contexts with proper senses of the target words; (2) Prompting: matched glosses and other textual information, such as synonyms, are incorporated using a prompting template; (3) Image retrieval: semantically matching images are retrieved from large open datasets using prompts as queries; (4) Modality fusion: contextual information from different modalities are fused and used for prediction. Although our system does not produce the most competitive results at SemEval-2023 Task 1, we are still able to beat nearly half of the teams. More importantly, our experiments reveal acute insights for the field of Word Sense Disambiguation (WSD) and multi-modal learning. Our code is available on GitHub.
翻译:视觉词义消歧(Visual Word Sense Disambiguation, VWSD)是一项多模态任务,旨在从一批候选图像中选出最准确体现目标词在有限上下文中含义的图像。本文提出一种多模态检索框架,最大程度利用预训练的视觉-语言模型、开放知识库及数据集。系统由以下关键组件构成:(1) 词汇释义匹配:采用预训练双编码器模型将上下文与目标词的恰当词义进行匹配;(2) 提示增强:通过提示模板整合匹配到的释义及其他文本信息(如同义词);(3) 图像检索:以提示为查询条件,从大型开放数据集中检索语义匹配的图像;(4) 模态融合:融合不同模态的上下文信息进行预测。尽管本系统在SemEval-2023任务1中未取得最具竞争力的结果,但仍能超越近半数参赛团队。更重要的是,实验为词义消歧(Word Sense Disambiguation, WSD)及多模态学习领域提供了深刻见解。相关代码已开源至GitHub。