Diffusion models have demonstrated excellent potential for generating diverse images. However, their performance often suffers from slow generation due to iterative denoising. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few without significant quality degradation. However, existing distillation methods either require significant amounts of offline computation for generating synthetic training data from the teacher model or need to perform expensive online learning with the help of real data. In this work, we present a novel technique called BOOT, that overcomes these limitations with an efficient data-free distillation algorithm. The core idea is to learn a time-conditioned model that predicts the output of a pre-trained diffusion model teacher given any time step. Such a model can be efficiently trained based on bootstrapping from two consecutive sampled steps. Furthermore, our method can be easily adapted to large-scale text-to-image diffusion models, which are challenging for conventional methods given the fact that the training sets are often large and difficult to access. We demonstrate the effectiveness of our approach on several benchmark datasets in the DDIM setting, achieving comparable generation quality while being orders of magnitude faster than the diffusion teacher. The text-to-image results show that the proposed approach is able to handle highly complex distributions, shedding light on more efficient generative modeling.
翻译:扩散模型在生成多样化图像方面展现出卓越潜力,然而其性能常因迭代去噪过程导致的生成速度缓慢而受限。知识蒸馏作为提升效率的解决方案近期被提出,能够将推理步骤缩减至单步或几步且不显著降低生成质量。但现有蒸馏方法需要大量离线计算以生成教师模型的合成训练数据,或依赖真实数据进行昂贵的在线学习。本文提出名为BOOT的新型技术,通过高效的无数据蒸馏算法突破上述限制。核心思想是学习一个时间条件模型,该模型能预测预训练扩散教师模型在任意时间步的输出。此类模型可基于相邻两个采样步骤的自举方法实现高效训练。此外,本方法可便捷适配大规模文生图扩散模型——传统方法因训练集规模庞大且难以获取而难以处理此类模型。我们在DDIM设置下的多个基准数据集上验证了方法的有效性,在实现与扩散教师模型相当生成质量的同时,推理速度提升数个数量级。文生图实验表明,所提方法能够处理高度复杂的分布,为更高效的生成建模提供了新思路。