Diffusion models achieve remarkable quality in image generation, but at a cost. Iterative denoising requires many time steps to produce high fidelity images. We argue that the denoising process is crucially limited by an accumulation of the reconstruction error due to an initial inaccurate reconstruction of the target data. This leads to lower quality outputs, and slower convergence. To address this issue, we propose compensation sampling to guide the generation towards the target domain. We introduce a compensation term, implemented as a U-Net, which adds negligible computation overhead during training and, optionally, inference. Our approach is flexible and we demonstrate its application in unconditional generation, face inpainting, and face de-occlusion using benchmark datasets CIFAR-10, CelebA, CelebA-HQ, FFHQ-256, and FSG. Our approach consistently yields state-of-the-art results in terms of image quality, while accelerating the denoising process to converge during training by up to an order of magnitude.
翻译:扩散模型在图像生成中取得了卓越的质量,但代价高昂。迭代去噪过程需要大量时间步才能生成高保真图像。我们认为,去噪过程的关键限制在于目标数据初始不准确重建所导致的累积重构误差,这会降低生成质量并减慢收敛速度。为解决该问题,我们提出补偿采样方法,引导生成过程向目标域收敛。我们引入一个以U-Net实现的补偿项,该补偿项在训练阶段(可选地包括推理阶段)仅增加极小的计算开销。该方法具有灵活性,我们在无条件生成、人脸修复和人脸去遮挡任务中,基于CIFAR-10、CelebA、CelebA-HQ、FFHQ-256及FSG等基准数据集验证了其有效性。我们的方法在图像质量上持续取得最优结果,同时将训练阶段的去噪收敛速度提升至一个数量级。