We introduce SODA, a self-supervised diffusion model, designed for representation learning. The model incorporates an image encoder, which distills a source view into a compact representation, that, in turn, guides the generation of related novel views. We show that by imposing a tight bottleneck between the encoder and a denoising decoder, and leveraging novel view synthesis as a self-supervised objective, we can turn diffusion models into strong representation learners, capable of capturing visual semantics in an unsupervised manner. To the best of our knowledge, SODA is the first diffusion model to succeed at ImageNet linear-probe classification, and, at the same time, it accomplishes reconstruction, editing and synthesis tasks across a wide range of datasets. Further investigation reveals the disentangled nature of its emergent latent space, that serves as an effective interface to control and manipulate the model's produced images. All in all, we aim to shed light on the exciting and promising potential of diffusion models, not only for image generation, but also for learning rich and robust representations.
翻译:我们提出SODA,一种用于表示学习的自监督扩散模型。该模型包含一个图像编码器,将源视图压缩为紧凑表示,进而指导相关新颖视图的生成。研究表明,通过在编码器与去噪解码器之间施加严格瓶颈,并利用新颖视图合成作为自监督目标,可以将扩散模型转化为强大的表示学习器,以无监督方式捕捉视觉语义。据我们所知,SODA是首个在ImageNet线性探测分类任务上取得成功的扩散模型,同时能在多种数据集上完成重建、编辑和合成任务。进一步探究揭示其涌现的潜在空间具有解耦特性,该空间可作为有效接口来控制与操纵模型生成的图像。总而言之,我们旨在揭示扩散模型不仅限于图像生成,更在学习丰富稳健表示方面令人振奋且充满前景的潜力。