The rapid advancement in self-supervised representation learning has highlighted its potential to leverage unlabeled data for learning rich visual representations. However, the existing techniques, particularly those employing different augmentations of the same image, often rely on a limited set of simple transformations that cannot fully capture variations in the real world. This constrains the diversity and quality of samples, which leads to sub-optimal representations. In this paper, we introduce a framework that enriches the self-supervised learning (SSL) paradigm by utilizing generative models to produce semantically consistent image augmentations. By directly conditioning generative models on a source image, our method enables the generation of diverse augmentations while maintaining the semantics of the source image, thus offering a richer set of data for SSL. Our extensive experimental results on various joint-embedding SSL techniques demonstrate that our framework significantly enhances the quality of learned visual representations by up to 10\% Top-1 accuracy in downstream tasks. This research demonstrates that incorporating generative models into the joint-embedding SSL workflow opens new avenues for exploring the potential of synthetic data. This development paves the way for more robust and versatile representation learning techniques.
翻译:自监督表示学习的快速发展突显了其利用未标记数据学习丰富视觉表征的潜力。然而,现有技术,特别是那些对同一图像采用不同增强的方法,通常依赖于一组有限的简单变换,无法完全捕捉现实世界中的变化。这限制了样本的多样性和质量,从而导致次优的表征。本文提出了一种框架,通过利用生成模型生成语义一致的图像增强,来丰富自监督学习范式。通过直接在源图像上条件化生成模型,我们的方法能够在保持源图像语义的同时生成多样化的增强,从而为自监督学习提供更丰富的数据集。我们在多种联合嵌入自监督学习技术上进行的广泛实验结果表明,我们的框架将学习到的视觉表征质量显著提升,在下游任务中Top-1准确率最高提升10%。这项研究表明,将生成模型融入联合嵌入自监督学习工作流,为探索合成数据的潜力开辟了新途径。这一发展为更鲁棒、更通用的表示学习技术铺平了道路。