Evaluating Text-to-Visual Generation with Image-to-Text Generation

from arxiv, We open-source our data, model, and code at: https://github.com/linzhiqiu/t2v_metrics ; Project page: https://linzhiqiu.github.io/papers/vqascore

Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.

翻译：尽管生成式人工智能取得了显著进展，但由于缺乏有效的度量标准和标准化基准，全面评估仍然具有挑战性。例如，广泛使用的CLIPScore衡量（生成的）图像与文本提示之间的对齐度，但对于涉及对象、属性和关系组合的复杂提示，它无法产生可靠的分数。一个原因是CLIP的文本编码器众所周知地可能充当“词袋”，混淆了诸如“马在吃草”与“草在吃马”这样的提示。为解决此问题，我们引入了VQAScore，它使用视觉问答模型，通过计算对简单问题“此图是否显示‘{文本}’？”回答“是”的概率来产生对齐分数。尽管比现有技术更简单，但使用现成模型计算的VQAScore在多个（8个）图像-文本对齐基准测试中取得了最先进的结果。我们还使用遵循文献中最佳实践的内部模型计算VQAScore。例如，我们采用双向图像-问题编码器，使得图像嵌入能够依赖于所提出的问题（反之亦然）。我们的内部模型CLIP-FlanT5甚至超越了利用专有GPT-4V的最强基线。有趣的是，尽管我们仅使用图像进行训练，VQAScore也能够对齐文本与视频及3D模型。VQAScore使研究人员能够使用捕捉真实世界提示组合结构的复杂文本来基准测试文本到视觉生成。我们引入了GenAI-Bench，这是一个更具挑战性的基准测试，包含1,600个组合文本提示，需要解析场景、对象、属性、关系以及如比较和逻辑等高阶推理。GenAI-Bench还为领先的图像和视频生成模型（如Stable Diffusion、DALL-E 3和Gen2）提供了超过15,000条人工评分。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日