Evaluating Text-to-Visual Generation with Image-to-Text Generation

from arxiv, We open-source our data, model, and code at: https://github.com/linzhiqiu/t2v_metrics ; Project page: https://linzhiqiu.github.io/papers/vqascore

Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.

翻译：摘要：尽管生成式AI取得了显著进展，但由于缺乏有效指标和标准化基准，全面评估仍充满挑战。例如，广泛使用的CLIPScore评估（生成）图像与文本提示的对齐度，但在涉及对象、属性及关系组合的复杂提示中无法生成可靠分数。原因之一在于CLIP文本编码器常以"词袋"模式运作，模糊处理如"马正在吃草"与"草正在吃马"等提示。为解决这一问题，我们提出了VQAScore，它通过视觉问答（VQA）模型计算回答"是"的概率（针对"此图是否显示'{文本}'？"这类简单问题），生成对齐分数。尽管方法比现有技术更简洁，但使用现成模型计算的VQAScore在多项（8项）图像-文本对齐基准测试中取得了最优结果。我们还采用遵循文献最佳实践的内部模型计算VQAScore。例如，我们使用双向图像-问题编码器，使图像嵌入能依赖于所提问题（反之亦然）。我们的内部模型CLIP-FlanT5甚至超越了使用专有GPT-4V的最强基线。值得关注的是，尽管仅使用图像进行训练，VQAScore还能实现文本与视频及3D模型的对齐。VQAScore使研究人员能够利用反映真实世界提示组合结构的复杂文本来评估文本到视觉生成。我们提出了GenAI-Bench，这是一个包含1600个组合文本提示的更具挑战性的基准，这些提示需解析场景、对象、属性、关系以及比较和逻辑等高阶推理。GenAI-Bench还为Stable Diffusion、DALL-E 3和Gen2等主流图像与视频生成模型提供了逾15000项人工评分。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日