Language models are increasingly being incorporated as components in larger AI systems for various purposes, from prompt optimization to automatic evaluation. In this work, we analyze the construct validity of four recent, commonly used methods for measuring text-to-image consistency - CLIPScore, TIFA, VPEval, and DSG - which rely on language models and/or VQA models as components. We define construct validity for text-image consistency metrics as a set of desiderata that text-image consistency metrics should have, and find that no tested metric satisfies all of them. We find that metrics lack sufficient sensitivity to language and visual properties. Next, we find that TIFA, VPEval and DSG contribute novel information above and beyond CLIPScore, but also that they correlate highly with each other. We also ablate different aspects of the text-image consistency metrics and find that not all model components are strictly necessary, also a symptom of insufficient sensitivity to visual information. Finally, we show that all three VQA-based metrics likely rely on familiar text shortcuts (such as yes-bias in QA) that call their aptitude as quantitative evaluations of model performance into question.
翻译:语言模型正日益被整合到更大的人工智能系统中,用于从提示优化到自动评估的各种目的。在本研究中,我们分析了四种近期常用且依赖语言模型和/或视觉问答模型作为组件的文本到图像一致性度量方法——CLIPScore、TIFA、VPEval和DSG——的结构效度。我们将文本-图像一致性度量的结构效度定义为一组理想特性,发现所有被测试的度量均未能完全满足这些特性。我们发现这些度量对语言和视觉属性的敏感性不足。其次,我们发现TIFA、VPEval和DSG在CLIPScore之外提供了新的信息,但它们彼此之间也高度相关。我们还对文本-图像一致性度量的不同方面进行了消融研究,发现并非所有模型组件都是严格必要的,这也是对视觉信息敏感性不足的表现。最后,我们表明所有三种基于视觉问答的度量都可能依赖熟悉的文本捷径(例如问答中的肯定偏见),这使它们作为模型性能定量评估工具的适用性受到质疑。