In this work, we introduce a self-supervised feature representation learning framework DreamTeacher that utilizes generative networks for pre-training downstream image backbones. We propose to distill knowledge from a trained generative model into standard image backbones that have been well engineered for specific perception tasks. We investigate two types of knowledge distillation: 1) distilling learned generative features onto target image backbones as an alternative to pretraining these backbones on large labeled datasets such as ImageNet, and 2) distilling labels obtained from generative networks with task heads onto logits of target backbones. We perform extensive analyses on multiple generative models, dense prediction benchmarks, and several pre-training regimes. We empirically find that our DreamTeacher significantly outperforms existing self-supervised representation learning approaches across the board. Unsupervised ImageNet pre-training with DreamTeacher leads to significant improvements over ImageNet classification pre-training on downstream datasets, showcasing generative models, and diffusion generative models specifically, as a promising approach to representation learning on large, diverse datasets without requiring manual annotation.
翻译:本文提出了一种名为DreamTeacher的自监督特征表示学习框架,该框架利用生成网络对下游图像骨干网络进行预训练。我们提出将训练好的生成模型中的知识蒸馏至为特定感知任务精心设计的标准图像骨干网络中。我们研究了两类知识蒸馏方法:1)将学习到的生成特征蒸馏到目标图像骨干网络,作为在ImageNet等大型标注数据集上预训练这些骨干网络的替代方案;2)将生成网络通过任务头获得的标签蒸馏到目标骨干网络的logits中。我们对多种生成模型、密集预测基准以及多种预训练机制进行了广泛分析。实验发现,DreamTeacher在各项指标上显著优于现有自监督表示学习方法。基于DreamTeacher的无监督ImageNet预训练在下游数据集上相较于ImageNet分类预训练取得了显著提升,这表明生成模型(特别是扩散生成模型)是一种无需人工标注即可在大型多样化数据集上进行表示学习的可行方法。