Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While remarkable, as the complexity of given text input increases, the state-of-the-art diffusion models may still fail in generating images which accurately convey the semantics of the given prompt. Furthermore, it has been observed that such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. In particular, we first introduce a Decompositional-Alignment-Score which given a complex prompt decomposes it into a set of disjoint assertions. The alignment of each assertion with generated images is then measured using a VQA model. Finally, alignment scores for different assertions are combined aposteriori to give the final text-to-image alignment score. Experimental analysis reveals that the proposed alignment metric shows significantly higher correlation with human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also find that the assertion level alignment scores provide a useful feedback which can then be used in a simple iterative procedure to gradually increase the expression of different assertions in the final image outputs. Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy. Project page for our paper is available at https://1jsingh.github.io/divide-evaluate-and-refine

翻译：文本条件图像生成领域随着近期潜在扩散模型的出现取得了前所未有的进展。尽管成果显著，但随着给定文本输入的复杂度增加，现有最先进的扩散模型在生成准确传达给定提示语义的图像时仍可能失败。此外，研究发现此类错位现象往往被预训练的多模态模型（如CLIP）漏检。为解决这些问题，本文探索了一种简单而有效的分解方法，用于评估和改进文生图对齐。具体而言，我们首先引入一个分解对齐分数（Decompositional-Alignment-Score），该分数将复杂提示分解为一组不相交的断言。随后利用VQA模型测量每个断言与生成图像的对齐程度。最后，不同断言的齐分数被后验组合，以得到最终的文生图对齐分数。实验分析表明，与传统的CLIP、BLIP分数相比，所提出的对齐指标与人类评分的相关性显著更高。此外，我们还发现断言级别的对齐分数能提供有用的反馈，可用于一个简单的迭代过程，逐步增强最终图像输出中不同断言的表达。人工用户研究显示，所提方法在整体文生图对齐准确率上超越先前最先进技术8.7%。本论文的项目页面可访问：https://1jsingh.github.io/divide-evaluate-and-refine