Boosting Text-to-Image Diffusion Models with Fine-Grained Semantic Rewards

Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from given text prompts. However, previous methods fail to perform accurate modality alignment between text concepts and generated images due to the lack of fine-level semantic guidance that successfully diagnoses the modality discrepancy. In this paper, we propose FineRewards to improve the alignment between text and images in text-to-image diffusion models by introducing two new fine-grained semantic rewards: the caption reward and the Semantic Segment Anything (SAM) reward. From the global semantic view, the caption reward generates a corresponding detailed caption that depicts all important contents in the synthetic image via a BLIP-2 model and then calculates the reward score by measuring the similarity between the generated caption and the given prompt. From the local semantic view, the SAM reward segments the generated images into local parts with category labels, and scores the segmented parts by measuring the likelihood of each category appearing in the prompted scene via a large language model, i.e., Vicuna-7B. Additionally, we adopt an assemble reward-ranked learning strategy to enable the integration of multiple reward functions to jointly guide the model training. Adapting results of text-to-image models on the MS-COCO benchmark show that the proposed semantic reward outperforms other baseline reward functions with a considerable margin on both visual quality and semantic similarity with the input prompt. Moreover, by adopting the assemble reward-ranked learning strategy, we further demonstrate that model performance is further improved when adapting under the unifying of the proposed semantic reward with the current image rewards.

翻译：近年来，文本到图像扩散模型在根据给定文本提示生成高质量、逼真图像方面取得了显著成功。然而，以往方法由于缺乏能够成功诊断模态差异的细粒度语义指导，无法在文本概念与生成图像之间实现精确的模态对齐。本文提出FineRewards方法，通过引入两种新的细粒度语义奖励——描述奖励和语义分割任意对象（SAM）奖励，来改善文本到图像扩散模型中文本与图像的对齐。从全局语义视角，描述奖励通过BLIP-2模型生成对应详细描述，囊括合成图像中所有重要内容，然后计算生成描述与给定提示之间的相似度以得出奖励分数。从局部语义视角，SAM奖励将生成图像分割为带有类别标签的局部区域，并通过大语言模型Vicuna-7B衡量每个类别出现在提示场景中的似然性来为分割区域打分。此外，我们采用集成奖励排序学习策略，实现多种奖励函数的整合，以联合指导模型训练。在MS-COCO基准上的文本到图像模型适配结果表明，所提出的语义奖励在视觉质量和与输入提示的语义相似度上均以显著幅度优于其他基线奖励函数。而且，通过采用集成奖励排序学习策略，我们进一步证明在将所提出的语义奖励与当前图像奖励统一适配时，模型性能得到进一步提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

CVPR 2020 论文开源项目合集

专知会员服务

111+阅读 · 2020年3月12日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日