As a highly expressive generative model, diffusion models have demonstrated exceptional success across various domains, including image generation, natural language processing, and combinatorial optimization. However, as data distributions grow more complex, training these models to convergence becomes increasingly computationally intensive. While diffusion models are typically trained using uniform timestep sampling, our research shows that the variance in stochastic gradients varies significantly across timesteps, with high-variance timesteps becoming bottlenecks that hinder faster convergence. To address this issue, we introduce a non-uniform timestep sampling method that prioritizes these more critical timesteps. Our method tracks the impact of gradient updates on the objective for each timestep, adaptively selecting those most likely to minimize the objective effectively. Experimental results demonstrate that this approach not only accelerates the training process, but also leads to improved performance at convergence. Furthermore, our method shows robust performance across various datasets, scheduling strategies, and diffusion architectures, outperforming previously proposed timestep sampling and weighting heuristics that lack this degree of robustness.
翻译:作为一种高表达力的生成模型,扩散模型在图像生成、自然语言处理和组合优化等多个领域取得了显著成功。然而,随着数据分布日益复杂,训练这些模型达到收敛所需的计算量也越来越大。尽管扩散模型通常采用均匀时间步采样进行训练,但我们的研究表明,随机梯度的方差在不同时间步之间存在显著差异,其中高方差时间步成为阻碍更快收敛的瓶颈。为解决这一问题,我们提出了一种非均匀时间步采样方法,该方法优先考虑这些更为关键的时间步。我们的方法追踪每个时间步的梯度更新对目标函数的影响,自适应地选择那些最有可能有效最小化目标函数的时间步。实验结果表明,该方法不仅加速了训练过程,还提高了收敛时的性能。此外,我们的方法在多种数据集、调度策略和扩散架构上均表现出鲁棒的性能,优于先前提出的缺乏此等鲁棒性的时间步采样与加权启发式方法。