We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.
翻译:摘要:我们探究了利用文本到图像模型生成的合成图像学习视觉表征的潜力。鉴于此类模型在生成高质量图像方面的卓越性能,这是一个自然而然的问题。我们特别考虑了Stable Diffusion,这是领先的开源文本到图像模型之一。我们证明:(1)当生成模型配置了适当的无分类器指导尺度时,在合成图像上训练自监督方法可以匹敌甚至超越在真实图像上的表现;(2)通过将同一文本提示生成的多个图像互为正样本,我们开发了一种多正样本对比学习方法,称为StableRep。仅使用合成图像,StableRep学习到的表征在大型数据集上超越了使用相同文本提示及对应真实图像训练的SimCLR和CLIP的性能。在进一步加入语言监督后,用2000万合成图像训练的StableRep达到了比用5000万真实图像训练的CLIP更高的准确率。