什么构成一个好的度量标准？评估文本到图像一致性的自动度量方法 (What makes a good metric? Evaluating automatic metrics for text-to-image consistency)

Language models are increasingly being incorporated as components in larger AI systems for various purposes, from prompt optimization to automatic evaluation. In this work, we analyze the construct validity of four recent, commonly used methods for measuring text-to-image consistency - CLIPScore, TIFA, VPEval, and DSG - which rely on language models and/or VQA models as components. We define construct validity for text-image consistency metrics as a set of desiderata that text-image consistency metrics should have, and find that no tested metric satisfies all of them. We find that metrics lack sufficient sensitivity to language and visual properties. Next, we find that TIFA, VPEval and DSG contribute novel information above and beyond CLIPScore, but also that they correlate highly with each other. We also ablate different aspects of the text-image consistency metrics and find that not all model components are strictly necessary, also a symptom of insufficient sensitivity to visual information. Finally, we show that all three VQA-based metrics likely rely on familiar text shortcuts (such as yes-bias in QA) that call their aptitude as quantitative evaluations of model performance into question.

翻译：语言模型正日益被整合到更大的人工智能系统中，用于从提示优化到自动评估的各种目的。在本研究中，我们分析了四种近期常用且依赖语言模型和/或视觉问答模型作为组件的文本到图像一致性度量方法——CLIPScore、TIFA、VPEval和DSG——的结构效度。我们将文本-图像一致性度量的结构效度定义为一组理想特性，发现所有被测试的度量均未能完全满足这些特性。我们发现这些度量对语言和视觉属性的敏感性不足。其次，我们发现TIFA、VPEval和DSG在CLIPScore之外提供了新的信息，但它们彼此之间也高度相关。我们还对文本-图像一致性度量的不同方面进行了消融研究，发现并非所有模型组件都是严格必要的，这也是对视觉信息敏感性不足的表现。最后，我们表明所有三种基于视觉问答的度量都可能依赖熟悉的文本捷径（例如问答中的肯定偏见），这使它们作为模型性能定量评估工具的适用性受到质疑。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日