Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at https://github.com/Sri-Harsha/visla_benchmark.
翻译:尽管当前最先进的语言模型取得了显著成功,但在把握某些关键语义细节方面仍面临挑战。本文提出VISLA(语义与词汇变换不变性与敏感性)基准测试,旨在评估语言模型的语义与词汇理解能力。该基准通过设计基于图像关联的三元语句对,构建了三路语义(不)等价任务,可用于评估视觉语言模型(VLM)和单模态语言模型(ULM)。对34个VLM和20个ULM的评估显示,这些模型在区分词汇变化与语义变化时存在显著困难。此外,语言模型编码的空间语义对词汇信息表现出高度敏感性。值得注意的是,VLM的文本编码器比单模态文本编码器对语义和词汇变化更敏感。本研究的贡献包括:统一图文检索与文本检索任务、无需微调的即用型评估、以及评估语言模型在词汇变化下的语义(不)变性。实验结果揭示了不同视觉语言模型与单模态语言模型的优势与不足,有助于深入理解其能力边界。% VISLA实现了严谨的评估,揭示了语言模型处理语义与词汇细微差别的能力。数据集与代码将在https://github.com/Sri-Harsha/visla_benchmark 公开。