While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which uses a single pre-training stage to address both families of tasks simultaneously. We identify diffusion models as a prime candidate. Diffusion models have risen to prominence as a state-of-the-art method for image generation, denoising, inpainting, super-resolution, manipulation, etc. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high fidelity, diverse, novel images. The U-Net architecture, as a convolution-based architecture, generates a diverse set of feature representations in the form of intermediate feature maps. We present our findings that these embeddings are useful beyond the noise prediction task, as they contain discriminative information and can also be leveraged for classification. We explore optimal methods for extracting and using these embeddings for classification tasks, demonstrating promising results on the ImageNet classification task. We find that with careful feature selection and pooling, diffusion models outperform comparable generative-discriminative methods such as BigBiGAN for classification tasks. We investigate diffusion models in the transfer learning regime, examining their performance on several fine-grained visual classification datasets. We compare these embeddings to those generated by competing architectures and pre-trainings for classification tasks.
翻译:虽然许多无监督学习模型专注于某一类任务(生成或判别),但我们探索了统一表示学习器的可能性:一种通过单一预训练阶段同时应对两类任务的模型。我们认为扩散模型是首要候选。扩散模型已成为图像生成、去噪、修复、超分辨率、操控等领域的先进方法。这类模型训练U-Net迭代预测并去除噪声,最终模型能够合成高保真、多样化且新颖的图像。U-Net架构基于卷积,能生成多种特征表征(即中间特征图)。我们的研究发现,这些嵌入不仅可用于噪声预测任务,还包含判别信息,可被用于分类任务。我们探索了提取和利用这些嵌入进行分类任务的最优方法,并在ImageNet分类任务上展示了有前景的结果。实验表明,通过精心选择特征和池化策略,扩散模型在分类任务上优于BigBiGAN等生成-判别混合方法。我们进一步研究了扩散模型在迁移学习场景中的表现,并在多个细粒度视觉分类数据集上评估其性能。我们将这些嵌入与当前用于分类任务的竞争架构及预训练方法生成的嵌入进行了对比。