Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability, correctness and unbiasedness. In this paper, we eliminate the need for such annotation by instead leveraging the flexibility of self-supervision signals to design a framework for self-guided diffusion models. By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks. Our experiments on single-label and multi-label image datasets demonstrate that self-labeled guidance always outperforms diffusion models without guidance and may even surpass guidance based on ground-truth labels, especially on unbalanced data. When equipped with self-supervised box or mask proposals, our method further generates visually diverse yet semantically consistent images, without the need for any class, box, or segment label annotation. Self-guided diffusion is simple, flexible and expected to profit from deployment at scale. Source code will be at: https://taohu.me/sgdm/
翻译:扩散模型在图像生成质量方面取得了显著进展,尤其是在使用引导控制生成过程时。然而,引导过程需要大量图像-标注对进行训练,因此依赖于其可用性、正确性和无偏性。本文通过利用自监督信号的灵活性设计自引导扩散模型框架,消除了对此类标注的需求。通过结合特征提取函数和自标注函数,我们的方法可在不同图像粒度上提供引导信号:从整体图像级别到目标框乃至分割掩码。在单标签和多标签图像数据集上的实验表明,自引导标签始终优于无引导扩散模型,甚至在数据不平衡时可能超越基于真实标签的引导。当配备自监督框或掩码提议时,我们的方法无需任何类别、框或分割标签标注,即可生成视觉多样且语义一致的图像。自引导扩散模型简单灵活,有望从大规模部署中受益。源代码将发布在:https://taohu.me/sgdm/