Learning from a large corpus of data, pre-trained models have achieved impressive progress nowadays. As popular generative pre-training, diffusion models capture both low-level visual knowledge and high-level semantic relations. In this paper, we propose to exploit such knowledgeable diffusion models for mainstream discriminative tasks, i.e., unsupervised object discovery: saliency segmentation and object localization. However, the challenges exist as there is one structural difference between generative and discriminative models, which limits the direct use. Besides, the lack of explicitly labeled data significantly limits performance in unsupervised settings. To tackle these issues, we introduce DiffusionSeg, one novel synthesis-exploitation framework containing two-stage strategies. To alleviate data insufficiency, we synthesize abundant images, and propose a novel training-free AttentionCut to obtain masks in the first synthesis stage. In the second exploitation stage, to bridge the structural gap, we use the inversion technique, to map the given image back to diffusion features. These features can be directly used by downstream architectures. Extensive experiments and ablation studies demonstrate the superiority of adapting diffusion for unsupervised object discovery.
翻译:从大规模数据语料中学习,预训练模型如今已取得令人瞩目的进展。作为流行的生成式预训练方法,扩散模型同时捕获了低层视觉知识与高层语义关系。本文提出利用此类富含知识的扩散模型来执行主流判别任务——无监督目标发现:显著性分割与目标定位。然而,生成模型与判别模型之间存在结构性差异,导致其难以直接应用;此外,无监督场景中显式标注数据的缺失严重制约了性能表现。为解决上述问题,我们提出DiffusionSeg,一种包含两阶段策略的新型"合成-利用"框架。为缓解数据不足,在第一合成阶段,我们合成大量图像,并提出一种无需训练的AttentionCut方法以获取掩码。在第二利用阶段,为弥合结构差异,我们采用逆向映射技术将给定图像回溯至扩散特征,这些特征可直接被下游架构使用。广泛的实验与消融研究证明了将扩散模型适配于无监督目标发现的优越性。