Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.
翻译:扩散模型凭借其在大规模数据集上无与伦比的扩展能力,目前主导了数据驱动图像合成领域。本文在不改变主流ADM扩散模型架构高层结构的前提下,识别并修正了若干导致训练不均匀且效率低下的原因。通过观察训练过程中网络激活值与权重的无控幅度变化及不平衡现象,我们重新设计了网络层,以保持激活值、权重及更新幅度的期望值。系统性地应用这一理念消除了观察到的漂移与不平衡,使得在同等计算复杂度下网络性能显著提升。我们的改进将ImageNet-512图像合成任务中先前记录的最优FID值从2.41降至1.81,该结果采用快速确定性采样实现。作为一项独立贡献,我们提出了一种事后(即训练完成后)设置指数移动平均(EMA)参数的方法。该方法无需多次训练即可精确调整EMA长度,并揭示了其与网络架构、训练时间及引导之间的惊人交互作用。