The goal of visual word sense disambiguation is to find the image that best matches the provided description of the word's meaning. It is a challenging problem, requiring approaches that combine language and image understanding. In this paper, we present our submission to SemEval 2023 visual word sense disambiguation shared task. The proposed system integrates multimodal embeddings, learning to rank methods, and knowledge-based approaches. We build a classifier based on the CLIP model, whose results are enriched with additional information retrieved from Wikipedia and lexical databases. Our solution was ranked third in the multilingual task and won in the Persian track, one of the three language subtasks.
翻译:视觉词义消歧的目标是找到与给定词语含义描述最匹配的图像。这是一个具有挑战性的问题,需要融合语言与图像理解的方法。本文介绍了我们在SemEval 2023视觉词义消歧共享任务中的提交方案。所提出的系统集成了多模态嵌入、学习排序方法以及基于知识的方法。我们构建了一个基于CLIP模型的分类器,其结果通过从维基百科和词汇数据库中检索的额外信息得到增强。我们的解决方案在多语言任务中排名第三,并在三个语言子任务之一的波斯语赛道上获得冠军。