Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Xiaoliang Dai,Ji Hou,Chih-Yao Ma,Sam Tsai,Jialiang Wang,Rui Wang,Peizhao Zhang,Simon Vandenhende,Xiaofang Wang,Abhimanyu Dubey,Matthew Yu,Abhishek Kadian,Filip Radenovic,Dhruv Mahajan,Kunpeng Li,Yue Zhao,Vladan Petrovic,Mitesh Kumar Singh,Simran Motwani,Yi Wen,Yiwen Song,Roshan Sumbaly,Vignesh Ramanathan,Zijian He,Peter Vajda,Devi Parikh

Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred $68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.

翻译：使用网络规模图文对训练文本到图像模型，能够从文本生成广泛视觉概念。然而，这类预训练模型在生成高美学质量图像时往往面临挑战，由此产生了对预训练后美学对齐的需求。本文提出质量微调方法，通过监督微调有效引导预训练模型仅生成极具视觉吸引力的图像，同时保持对各类视觉概念的泛化能力。我们的核心发现是：使用极小规模但视觉质量极高的图像集进行监督微调，可显著提升生成质量。我们在11亿图文对上预训练潜在扩散模型，随后仅用数千张精心挑选的高质量图像进行微调。所得模型Emu与仅经过预训练的对照组相比，胜率达到82.9%。在与当前最先进的SDXLv1.0对比中，Emu在标准PartiPrompts基准以及基于真实文生图模型使用场景的开放用户输入基准上，分别以68.4%和71.3%的胜率在视觉吸引力上获得偏好。此外，我们证明质量微调是一种通用方法，对其他架构（包括像素扩散模型和掩码生成Transformer模型）同样有效。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日