Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

Diffusion models, emerging as powerful deep generative tools, excel in various applications. They operate through a two-steps process: introducing noise into training samples and then employing a model to convert random noise into new samples (e.g., images). However, their remarkable generative performance is hindered by slow training and sampling. This is due to the necessity of tracking extensive forward and reverse diffusion trajectories, and employing a large model with numerous parameters across multiple timesteps (i.e., noise levels). To tackle these challenges, we present a multi-stage framework inspired by our empirical findings. These observations indicate the advantages of employing distinct parameters tailored to each timestep while retaining universal parameters shared across all time steps. Our approach involves segmenting the time interval into multiple stages where we employ custom multi-decoder U-net architecture that blends time-dependent models with a universally shared encoder. Our framework enables the efficient distribution of computational resources and mitigates inter-stage interference, which substantially improves training efficiency. Extensive numerical experiments affirm the effectiveness of our framework, showcasing significant training and sampling efficiency enhancements on three state-of-the-art diffusion models, including large-scale latent diffusion models. Furthermore, our ablation studies illustrate the impact of two important components in our framework: (i) a novel timestep clustering algorithm for stage division, and (ii) an innovative multi-decoder U-net architecture, seamlessly integrating universal and customized hyperparameters.

翻译：扩散模型作为强大的深度生成工具，在多种应用中表现卓越。其运行包含两个步骤：首先在训练样本中引入噪声，随后利用模型将随机噪声转化为新样本（如图像）。然而，其卓越的生成性能受限于缓慢的训练与采样速度。这源于需要追踪大量前向与反向扩散轨迹，并在多个时间步（即噪声水平）上使用参数量庞大的模型。为应对这些挑战，我们基于实证发现提出一种多阶段框架。研究结果表明，为每个时间步配备定制化参数，同时保留跨所有时间步的通用参数具有显著优势。我们的方法将时间区间划分为多个阶段，在每个阶段采用定制的多解码器U-net架构，该架构将时间依赖模型与全局共享编码器相结合。该框架能够高效分配计算资源并减轻阶段间干扰，从而显著提升训练效率。大量数值实验验证了我们框架的有效性，在包括大规模潜在扩散模型在内的三种先进扩散模型上实现了训练与采样效率的显著提升。此外，消融研究阐明了框架中两个关键组件的作用：(i) 用于阶段划分的新型时间步聚类算法，以及(ii) 创新性的多解码器U-net架构，该架构无缝整合了通用参数与定制化超参数。