We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs. The resulting representations transfer well to many downstream tasks, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR outperforms previous self-supervised methods by a significant margin, e.g., improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.
翻译:我们提出SynCLR,一种完全基于合成图像和合成字幕(无需任何真实数据)学习视觉表示的新方法。利用大语言模型(LLMs)合成大规模图像字幕数据集,随后采用现成的文生图模型为每条合成字幕生成多幅对应图像。我们通过对比学习在这些合成图像上进行视觉表示学习,将共享相同字幕的图像视为正样本对。学习得到的表示在众多下游任务中表现优异,在图像分类任务中能够与CLIP、DINO v2等通用视觉表示学习方法有力竞争。此外,在语义分割等密集预测任务中,SynCLR显著超越先前自监督方法,例如在ADE20k数据集上使用ViT-B/16时,相较MAE和iBOT分别提升6.2和4.3 mIoU。