Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred $68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.
翻译:使用网络规模图文对训练文本到图像模型,能够从文本生成广泛视觉概念。然而,这类预训练模型在生成高美学质量图像时往往面临挑战,由此产生了对预训练后美学对齐的需求。本文提出质量微调方法,通过监督微调有效引导预训练模型仅生成极具视觉吸引力的图像,同时保持对各类视觉概念的泛化能力。我们的核心发现是:使用极小规模但视觉质量极高的图像集进行监督微调,可显著提升生成质量。我们在11亿图文对上预训练潜在扩散模型,随后仅用数千张精心挑选的高质量图像进行微调。所得模型Emu与仅经过预训练的对照组相比,胜率达到82.9%。在与当前最先进的SDXLv1.0对比中,Emu在标准PartiPrompts基准以及基于真实文生图模型使用场景的开放用户输入基准上,分别以68.4%和71.3%的胜率在视觉吸引力上获得偏好。此外,我们证明质量微调是一种通用方法,对其他架构(包括像素扩散模型和掩码生成Transformer模型)同样有效。