Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.
翻译:扩散模型目前凭借其在大规模数据集上的卓越扩展能力,主导了数据驱动图像合成领域。本文在不改变流行的ADM扩散模型架构高层次结构的前提下,识别并修正了其训练过程中不均匀且效率低下的若干原因。通过观察训练过程中网络激活值和权重的幅度变化不受控且存在失衡现象,我们重新设计了网络层,以期望上保持激活值、权重和更新幅度的一致性。研究发现,系统性地应用这一理念可消除所观察到的漂移和失衡现象,从而在相同计算复杂度下获得性能显著提升的网络。这一改进将ImageNet-512图像合成任务中先前创下的FID最佳记录从2.41提升至1.81(采用快速确定性采样)。作为独立贡献,我们提出了一种事后(即在完成完整训练后)设置指数移动平均(EMA)参数的方法。该方法无需多次训练即可精确调整EMA长度,并揭示了其与网络架构、训练时间以及引导之间令人惊讶的相互作用关系。