Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While remarkable, as the complexity of given text input increases, the state-of-the-art diffusion models may still fail in generating images which accurately convey the semantics of the given prompt. Furthermore, it has been observed that such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. In particular, we first introduce a Decompositional-Alignment-Score which given a complex prompt decomposes it into a set of disjoint assertions. The alignment of each assertion with generated images is then measured using a VQA model. Finally, alignment scores for different assertions are combined aposteriori to give the final text-to-image alignment score. Experimental analysis reveals that the proposed alignment metric shows significantly higher correlation with human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also find that the assertion level alignment scores provide a useful feedback which can then be used in a simple iterative procedure to gradually increase the expression of different assertions in the final image outputs. Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy. Project page for our paper is available at https://1jsingh.github.io/divide-evaluate-and-refine

翻译：随着潜在扩散模型的最新发展，文本条件图像生成领域取得了前所未有的进展。尽管这些模型表现出色，但随着给定文本输入的复杂性的增加，最先进的扩散模型仍可能无法生成准确传达给定提示语义的图像。此外，现有预训练多模态模型（如CLIP）往往无法检测到此类错位问题。为解决这些问题，本文探索了一种简单而有效的分解方法，用于评估和改进文本到图像的对齐。具体而言，我们首先引入了一个分解对齐分数，该分数将复杂提示分解为一组互不相交的断言。随后，利用VQA模型测量每个断言与生成图像的对齐程度。最后，将不同断言的分数后验组合，得到最终的文本到图像对齐分数。实验分析表明，与传统CLIP和BLIP分数相比，所提出的对齐指标与人类评分的相关性显著更高。此外，我们发现断言级别的对齐分数提供了有用的反馈，可通过简单的迭代过程逐步增强最终图像输出中不同断言的表达。人工用户研究表明，所提方法在整体文本到图像对齐准确率上比先前最先进技术高出8.7%。本文的项目页面见：https://1jsingh.github.io/divide-evaluate-and-refine