Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from given text prompts. However, previous methods fail to perform accurate modality alignment between text concepts and generated images due to the lack of fine-level semantic guidance that successfully diagnoses the modality discrepancy. In this paper, we propose FineRewards to improve the alignment between text and images in text-to-image diffusion models by introducing two new fine-grained semantic rewards: the caption reward and the Semantic Segment Anything (SAM) reward. From the global semantic view, the caption reward generates a corresponding detailed caption that depicts all important contents in the synthetic image via a BLIP-2 model and then calculates the reward score by measuring the similarity between the generated caption and the given prompt. From the local semantic view, the SAM reward segments the generated images into local parts with category labels, and scores the segmented parts by measuring the likelihood of each category appearing in the prompted scene via a large language model, i.e., Vicuna-7B. Additionally, we adopt an assemble reward-ranked learning strategy to enable the integration of multiple reward functions to jointly guide the model training. Adapting results of text-to-image models on the MS-COCO benchmark show that the proposed semantic reward outperforms other baseline reward functions with a considerable margin on both visual quality and semantic similarity with the input prompt. Moreover, by adopting the assemble reward-ranked learning strategy, we further demonstrate that model performance is further improved when adapting under the unifying of the proposed semantic reward with the current image rewards.
翻译:近年来,文本到图像扩散模型在根据给定文本提示生成高质量、逼真图像方面取得了显著成功。然而,以往方法由于缺乏能够成功诊断模态差异的细粒度语义指导,无法在文本概念与生成图像之间实现精确的模态对齐。本文提出FineRewards方法,通过引入两种新的细粒度语义奖励——描述奖励和语义分割任意对象(SAM)奖励,来改善文本到图像扩散模型中文本与图像的对齐。从全局语义视角,描述奖励通过BLIP-2模型生成对应详细描述,囊括合成图像中所有重要内容,然后计算生成描述与给定提示之间的相似度以得出奖励分数。从局部语义视角,SAM奖励将生成图像分割为带有类别标签的局部区域,并通过大语言模型Vicuna-7B衡量每个类别出现在提示场景中的似然性来为分割区域打分。此外,我们采用集成奖励排序学习策略,实现多种奖励函数的整合,以联合指导模型训练。在MS-COCO基准上的文本到图像模型适配结果表明,所提出的语义奖励在视觉质量和与输入提示的语义相似度上均以显著幅度优于其他基线奖励函数。而且,通过采用集成奖励排序学习策略,我们进一步证明在将所提出的语义奖励与当前图像奖励统一适配时,模型性能得到进一步提升。