Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.
翻译:摘要:尽管生成式AI取得了显著进展,但由于缺乏有效指标和标准化基准,全面评估仍充满挑战。例如,广泛使用的CLIPScore评估(生成)图像与文本提示的对齐度,但在涉及对象、属性及关系组合的复杂提示中无法生成可靠分数。原因之一在于CLIP文本编码器常以"词袋"模式运作,模糊处理如"马正在吃草"与"草正在吃马"等提示。为解决这一问题,我们提出了VQAScore,它通过视觉问答(VQA)模型计算回答"是"的概率(针对"此图是否显示'{文本}'?"这类简单问题),生成对齐分数。尽管方法比现有技术更简洁,但使用现成模型计算的VQAScore在多项(8项)图像-文本对齐基准测试中取得了最优结果。我们还采用遵循文献最佳实践的内部模型计算VQAScore。例如,我们使用双向图像-问题编码器,使图像嵌入能依赖于所提问题(反之亦然)。我们的内部模型CLIP-FlanT5甚至超越了使用专有GPT-4V的最强基线。值得关注的是,尽管仅使用图像进行训练,VQAScore还能实现文本与视频及3D模型的对齐。VQAScore使研究人员能够利用反映真实世界提示组合结构的复杂文本来评估文本到视觉生成。我们提出了GenAI-Bench,这是一个包含1600个组合文本提示的更具挑战性的基准,这些提示需解析场景、对象、属性、关系以及比较和逻辑等高阶推理。GenAI-Bench还为Stable Diffusion、DALL-E 3和Gen2等主流图像与视频生成模型提供了逾15000项人工评分。