This paper introduces DiffMix, a new self-supervised learning (SSL) pre-training framework that combines real and synthetic images. Unlike traditional SSL methods that predominantly use real images, DiffMix uses a variant of Stable Diffusion to replace an augmented instance of a real image, facilitating the learning of cross real-synthetic image representations. The key insight is that while SSL methods trained solely on synthetic images underperform compared to those trained on real images, a blended training approach using both real and synthetic images leads to more robust and adaptable representations. Experiments demonstrate that DiffMix enhances the SSL methods SimCLR, BarlowTwins, and DINO, across various robustness datasets and domain transfer tasks. DiffMix boosts SimCLR's accuracy on ImageNet-1K by 4.56\%. These results challenge the notion that high-quality real images are crucial for SSL pre-training by showing that lower quality synthetic images can also produce strong representations. DiffMix also reduces the need for image augmentations in SSL, offering new optimization strategies.
翻译:本文提出DiffMix,一种结合真实图像与合成图像的新型自监督学习预训练框架。与传统自监督学习方法主要使用真实图像不同,DiffMix采用Stable Diffusion的变体来替换真实图像的增强实例,从而促进跨真实-合成图像表征的学习。核心洞见在于:尽管仅使用合成图像训练的自监督学习方法性能不及基于真实图像的方法,但融合真实与合成图像的混合训练策略能够产生更鲁棒且适应性更强的表征。实验表明,DiffMix在多种鲁棒性数据集和领域迁移任务中有效提升了SimCLR、BarlowTwins和DINO等自监督方法的性能。DiffMix将SimCLR在ImageNet-1K上的准确率提升了4.56%。这些结果通过证明低质量合成图像同样能生成强表征,挑战了“高质量真实图像对自监督预训练至关重要”的传统观点。DiffMix还减少了自监督学习中对图像增强的依赖,为优化策略提供了新思路。