Visual Word Sense Disambiguation (VWSD) is a novel challenging task that lies between linguistic sense disambiguation and fine-grained multimodal retrieval. The recent advancements in the development of visiolinguistic (VL) transformers suggest some off-the-self implementations with encouraging results, which however we argue that can be further improved. To this end, we propose some knowledge-enhancement techniques towards improving the retrieval performance of VL transformers via the usage of Large Language Models (LLMs) as Knowledge Bases. More specifically, knowledge stored in LLMs is retrieved with the help of appropriate prompts in a zero-shot manner, achieving performance advancements. Moreover, we convert VWSD to a purely textual question-answering (QA) problem by considering generated image captions as multiple-choice candidate answers. Zero-shot and few-shot prompting strategies are leveraged to explore the potential of such a transformation, while Chain-of-Thought (CoT) prompting in the zero-shot setting is able to reveal the internal reasoning steps an LLM follows to select the appropriate candidate. In total, our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve WVSD.
翻译:视觉词义消歧(VWSD)是一项介于语言消歧与细粒度多模态检索之间的新颖且具有挑战性的任务。近年来,视觉语言(VL)Transformer的发展催生了一些现成解决方案,并取得了令人鼓舞的结果,但我们认为这些结果仍有进一步改进空间。为此,我们提出若干知识增强技术,通过利用大型语言模型(LLM)作为知识库来提升VL Transformer的检索性能。具体而言,我们通过适当提示以零样本方式从LLM中检索存储的知识,从而实现性能提升。此外,通过将生成的图像描述视为多项选择候选答案,我们将VWSD转化为纯文本问题回答任务。我们采用零样本与少样本提示策略探索此类转换的潜力,而零样本场景下的思维链(CoT)提示能够揭示LLM选择合适候选答案的内部推理步骤。总体而言,本文首次系统分析了以不同方式利用LLM中存储知识解决VWSD的优势。