Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.
翻译:尽管生成式人工智能取得了显著进展,但由于缺乏有效的度量标准和标准化基准,全面评估仍然具有挑战性。例如,广泛使用的CLIPScore衡量(生成的)图像与文本提示之间的对齐度,但对于涉及对象、属性和关系组合的复杂提示,它无法产生可靠的分数。一个原因是CLIP的文本编码器众所周知地可能充当“词袋”,混淆了诸如“马在吃草”与“草在吃马”这样的提示。为解决此问题,我们引入了VQAScore,它使用视觉问答模型,通过计算对简单问题“此图是否显示‘{文本}’?”回答“是”的概率来产生对齐分数。尽管比现有技术更简单,但使用现成模型计算的VQAScore在多个(8个)图像-文本对齐基准测试中取得了最先进的结果。我们还使用遵循文献中最佳实践的内部模型计算VQAScore。例如,我们采用双向图像-问题编码器,使得图像嵌入能够依赖于所提出的问题(反之亦然)。我们的内部模型CLIP-FlanT5甚至超越了利用专有GPT-4V的最强基线。有趣的是,尽管我们仅使用图像进行训练,VQAScore也能够对齐文本与视频及3D模型。VQAScore使研究人员能够使用捕捉真实世界提示组合结构的复杂文本来基准测试文本到视觉生成。我们引入了GenAI-Bench,这是一个更具挑战性的基准测试,包含1,600个组合文本提示,需要解析场景、对象、属性、关系以及如比较和逻辑等高阶推理。GenAI-Bench还为领先的图像和视频生成模型(如Stable Diffusion、DALL-E 3和Gen2)提供了超过15,000条人工评分。