We describe the systems of the University of Alberta team for the SemEval-2023 Visual Word Sense Disambiguation (V-WSD) Task. We present a novel algorithm that leverages glosses retrieved from BabelNet, in combination with text and image encoders. Furthermore, we compare language-specific encoders against the application of English encoders to translated texts. As the contexts given in the task datasets are extremely short, we also experiment with augmenting these contexts with descriptions generated by a language model. This yields substantial improvements in accuracy. We describe and evaluate additional V-WSD methods which use image generation and text-conditioned image segmentation. Overall, the results of our official submission rank us 18 out of 56 teams. Some of our unofficial results are even better than the official ones. Our code is publicly available at https://github.com/UAlberta-NLP/v-wsd.
翻译:本文介绍了阿尔伯塔大学团队在SemEval-2023视觉词义消歧(V-WSD)任务中开发的系统。我们提出了一种新颖算法,利用从BabelNet中获取的词汇释义,结合文本与图像编码器。此外,我们对比了语言专用编码器与将英语编码器应用于翻译文本的效果。由于任务数据集中的上下文极其简短,我们还尝试使用语言模型生成的描述来增强这些上下文,从而显著提升准确率。我们描述并评估了其他基于图像生成与文本条件图像分割的V-WSD方法。总体而言,我们的官方提交结果在56支参赛队伍中排名第18位,部分非官方结果甚至优于官方成绩。我们的代码已公开在https://github.com/UAlberta-NLP/v-wsd。