Large-scale training of latent diffusion models (LDMs) has enabled unprecedented quality in image generation. However, the key components of the best performing LDM training recipes are oftentimes not available to the research community, preventing apple-to-apple comparisons and hindering the validation of progress in the field. In this work, we perform an in-depth study of LDM training recipes focusing on the performance of models and their training efficiency. To ensure apple-to-apple comparisons, we re-implement five previously published models with their corresponding recipes. Through our study, we explore the effects of (i)~the mechanisms used to condition the generative model on semantic information (e.g., text prompt) and control metadata (e.g., crop size, random flip flag, etc.) on the model performance, and (ii)~the transfer of the representations learned on smaller and lower-resolution datasets to larger ones on the training efficiency and model performance. We then propose a novel conditioning mechanism that disentangles semantic and control metadata conditionings and sets a new state-of-the-art in class-conditional generation on the ImageNet-1k dataset -- with FID improvements of 7% on 256 and 8% on 512 resolutions -- as well as text-to-image generation on the CC12M dataset -- with FID improvements of 8% on 256 and 23% on 512 resolution.
翻译:潜扩散模型(LDMs)的大规模训练已在图像生成领域实现了前所未有的质量。然而,性能最佳的LDM训练方案的关键组成部分通常不为研究社区所公开,这阻碍了公平比较,并制约了该领域进展的有效验证。本研究对LDM训练方案进行了深入分析,重点关注模型性能及其训练效率。为确保公平比较,我们重新实现了五个先前发表的模型及其对应训练方案。通过系统研究,我们探索了以下两方面的影响:(i)用于将生成模型与语义信息(如文本提示)及控制元数据(如裁剪尺寸、随机翻转标志等)进行条件耦合的机制对模型性能的影响;(ii)在较小规模及较低分辨率数据集上学习到的表征向更大规模数据集迁移时对训练效率和模型性能的影响。基于此,我们提出了一种新颖的条件机制,该机制解耦了语义条件与控制元数据条件,并在ImageNet-1k数据集的类别条件生成任务中取得了新的最优性能——在256和512分辨率下FID分别提升7%和8%,同时在CC12M数据集的文本到图像生成任务中——在256和512分辨率下FID分别提升8%和23%。