Diffusion inversion is a task of recovering the noise of an image in a diffusion model, which is vital for controllable diffusion image editing. At present, diffusion inversion still remains a challenging task due to the lack of viable supervision signals. Thus, most existing methods resort to approximation-based solutions, which however are often at the cost of performance or efficiency. To remedy these shortcomings, we propose a novel self-supervised diffusion inversion approach in this paper, termed Deep Inversion (DeepInv). Instead of requiring ground-truth noise annotations, we introduce a self-supervised objective as well as a data augmentation strategy to generate high-quality pseudo noises from real images without manual intervention. Based on these two innovative designs, DeepInv is also equipped with an iterative and multi-scale training regime to train a parameterized inversion solver, thereby achieving the fast and accurate image-to-noise mapping. To the best of our knowledge, this is the first attempt of presenting a trainable solver to predict inversion noise step by step. The extensive experiments show that our DeepInv can achieve much better performance and inference speed than the compared methods, e.g., +40.435% SSIM than EasyInv and +9887.5% speed than ReNoise on COCO dataset. Moreover, our careful designs of trainable solvers can also provide insights to the community. Codes and model parameters will be released in https://github.com/potato-kitty/DeepInv.
翻译:扩散反演是在扩散模型中恢复图像噪声的任务,这对于可控的扩散图像编辑至关重要。目前,由于缺乏有效的监督信号,扩散反演仍然是一项具有挑战性的任务。因此,现有方法大多采用基于近似的解决方案,但这往往以牺牲性能或效率为代价。为弥补这些不足,本文提出了一种新颖的自监督扩散反演方法,称为深度反演(DeepInv)。该方法无需真实噪声标注,而是通过引入自监督目标及数据增强策略,从真实图像中自动生成高质量伪噪声。基于这两项创新设计,DeepInv还采用迭代多尺度训练机制来训练参数化的反演求解器,从而实现快速准确的图像到噪声映射。据我们所知,这是首次尝试提出可训练的求解器来逐步预测反演噪声。大量实验表明,我们的DeepInv在性能和推理速度上均显著优于对比方法,例如在COCO数据集上,其SSIM指标较EasyInv提升40.435%,速度较ReNoise提升9887.5%。此外,我们对可训练求解器的精心设计也能为相关领域提供启发。代码与模型参数将发布于https://github.com/potato-kitty/DeepInv。