Inspired by recent advances in diffusion models, which are reminiscent of denoising autoencoders, we investigate whether they can acquire discriminative representations for classification via generative pre-training. This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners: by pre-training on unconditional image generation, DDAE has already learned strongly linear-separable representations at its intermediate layers without auxiliary encoders, thus making diffusion pre-training emerge as a general approach for self-supervised generative and discriminative learning. To verify this, we perform linear probe and fine-tuning evaluations on multi-class datasets. Our diffusion-based approach achieves 95.9% and 50.0% linear probe accuracies on CIFAR-10 and Tiny-ImageNet, respectively, and is comparable to masked autoencoders and contrastive learning for the first time. Additionally, transfer learning from ImageNet confirms DDAE's suitability for latent-space Vision Transformers, suggesting the potential for scaling DDAEs as unified foundation models.
翻译:受扩散模型最新进展的启发——这些模型让人联想到去噪自编码器——我们探究了它们能否通过生成式预训练获取用于分类的判别性表征。本文表明,扩散模型中的网络,即去噪扩散自编码器(DDAE),是一种统一的自监督学习器:通过在无条件图像生成上的预训练,DDAE在其中间层无需辅助编码器就已学到强线性可分离的表征,从而使扩散预训练成为自监督生成式和判别式学习的一般性方法。为验证这一点,我们在多类数据集上进行了线性探测和微调评估。基于扩散的方法在CIFAR-10和Tiny-ImageNet上分别达到95.9%和50.0%的线性探测准确率,并首次与掩码自编码器和对比学习表现相当。此外,从ImageNet进行的迁移学习证实了DDAE在潜空间视觉Transformer(Vision Transformers)中的适用性,提示了将DDAE扩展为统一基础模型的潜力。