Ambiguity poses persistent challenges in natural language understanding for large language models (LLMs). To better understand how lexical ambiguity can be resolved through the visual domain, we develop an interpretable Visual Word Sense Disambiguation (VWSD) framework. The model leverages CLIP to project ambiguous language and candidate images into a shared multimodal space. We enrich textual embeddings using a dual-channel ensemble of semantic and photo-based prompts with WordNet synonyms, while image embeddings are refined through robust test-time augmentations. We then use cosine similarity to determine the image that best aligns with the ambiguous text. When evaluated on the SemEval-2023 VWSD dataset, enriching the embeddings raises the MRR from 0.7227 to 0.7590 and the Hit Rate from 0.5810 to 0.6220. Ablation studies reveal that dual-channel prompting provides strong, low-latency performance, whereas aggressive image augmentation yields only marginal gains. Additional experiments with WordNet definitions and multilingual prompt ensembles further suggest that noisy external signals tend to dilute semantic specificity, reinforcing the effectiveness of precise, CLIP-aligned prompts for visual word sense disambiguation.
翻译:在自然语言理解中,歧义性对大语言模型(LLMs)构成持续挑战。为探究如何通过视觉领域解决词汇歧义问题,本文提出了一种可解释的视觉词义消歧(VWSD)框架。该模型利用CLIP将歧义性文本与候选图像映射到共享的多模态空间中。我们通过融合语义提示与基于照片提示的双通道集成方法,结合WordNet同义词库增强文本嵌入表示;同时通过鲁棒的测试时增强技术优化图像嵌入。随后采用余弦相似度度量确定与歧义文本最匹配的图像。在SemEval-2023 VWSD数据集上的评估表明:嵌入增强策略将MRR从0.7227提升至0.7590,命中率从0.5810提高至0.6220。消融实验显示,双通道提示机制在保持低延迟的同时提供强劲性能,而激进的图像增强仅带来边际收益。通过WordNet释义与多语言提示集成的补充实验进一步表明,噪声外部信号易削弱语义特异性,这印证了精确的CLIP对齐提示在视觉词义消歧中的有效性。