On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models

Tariq Berrada Ifriqi,Pietro Astolfi,Melissa Hall,Reyhane Askari-Hemmat,Yohann Benchetrit,Marton Havasi,Matthew Muckley,Karteek Alahari,Adriana Romero-Soriano,Jakob Verbeek,Michal Drozdzal

from arxiv, Accepted as a conference paper (poster) for NeurIPS 2024

Large-scale training of latent diffusion models (LDMs) has enabled unprecedented quality in image generation. However, the key components of the best performing LDM training recipes are oftentimes not available to the research community, preventing apple-to-apple comparisons and hindering the validation of progress in the field. In this work, we perform an in-depth study of LDM training recipes focusing on the performance of models and their training efficiency. To ensure apple-to-apple comparisons, we re-implement five previously published models with their corresponding recipes. Through our study, we explore the effects of (i)~the mechanisms used to condition the generative model on semantic information (e.g., text prompt) and control metadata (e.g., crop size, random flip flag, etc.) on the model performance, and (ii)~the transfer of the representations learned on smaller and lower-resolution datasets to larger ones on the training efficiency and model performance. We then propose a novel conditioning mechanism that disentangles semantic and control metadata conditionings and sets a new state-of-the-art in class-conditional generation on the ImageNet-1k dataset -- with FID improvements of 7% on 256 and 8% on 512 resolutions -- as well as text-to-image generation on the CC12M dataset -- with FID improvements of 8% on 256 and 23% on 512 resolution.

翻译：潜扩散模型（LDMs）的大规模训练已在图像生成领域实现了前所未有的质量。然而，性能最佳的LDM训练方案的关键组成部分通常不为研究社区所公开，这阻碍了公平比较，并制约了该领域进展的有效验证。本研究对LDM训练方案进行了深入分析，重点关注模型性能及其训练效率。为确保公平比较，我们重新实现了五个先前发表的模型及其对应训练方案。通过系统研究，我们探索了以下两方面的影响：（i）用于将生成模型与语义信息（如文本提示）及控制元数据（如裁剪尺寸、随机翻转标志等）进行条件耦合的机制对模型性能的影响；（ii）在较小规模及较低分辨率数据集上学习到的表征向更大规模数据集迁移时对训练效率和模型性能的影响。基于此，我们提出了一种新颖的条件机制，该机制解耦了语义条件与控制元数据条件，并在ImageNet-1k数据集的类别条件生成任务中取得了新的最优性能——在256和512分辨率下FID分别提升7%和8%，同时在CC12M数据集的文本到图像生成任务中——在256和512分辨率下FID分别提升8%和23%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日