Contrastive Learning (CL) has emerged as one of the most successful paradigms for unsupervised visual representation learning, yet it often depends on intensive manual data augmentations. With the rise of generative models, especially diffusion models, the ability to generate realistic images close to the real data distribution has been well recognized. These generated high-equality images have been successfully applied to enhance contrastive representation learning, a technique termed ``data inflation''. However, we find that the generated data (even from a good diffusion model like DDPM) may sometimes even harm contrastive learning. We investigate the causes behind this failure from the perspective of both data inflation and data augmentation. For the first time, we reveal the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa. We also provide rigorous theoretical explanations for these phenomena via deriving its generalization bounds under data inflation. Drawing from these insights, we propose Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost. On benchmark datasets, AdaInf can bring significant improvements for various contrastive learning methods. Notably, without using external data, AdaInf obtains 94.70% linear accuracy on CIFAR-10 with SimCLR, setting a new record that surpasses many sophisticated methods. Code is available at https://github.com/PKU-ML/adainf.
翻译:对比学习(CL)已成为无监督视觉表征学习最成功的范式之一,但其通常依赖于密集的人工数据增强。随着生成模型(尤其是扩散模型)的兴起,其生成接近真实数据分布的逼真图像的能力已得到广泛认可。这些高质量生成图像已成功应用于增强对比表征学习,该技术被称为“数据膨胀”。然而,我们发现生成数据(即使来自像DDPM这样的优秀扩散模型)有时甚至可能损害对比学习。我们从数据膨胀和数据增强两个角度探究了这种失败背后的原因。我们首次揭示了更强的数据膨胀应配合更弱的数据增强,反之亦然,两者具有互补作用。我们还通过推导数据膨胀下的泛化界,为这些现象提供了严格的理论解释。基于这些见解,我们提出了自适应膨胀(AdaInf),一种纯数据中心的策略,无需引入任何额外计算成本。在基准数据集上,AdaInf可为多种对比学习方法带来显著改进。值得注意的是,在不使用外部数据的情况下,AdaInf通过SimCLR在CIFAR-10上达到了94.70%的线性准确率,创下了超越诸多复杂方法的新纪录。代码可在https://github.com/PKU-ML/adainf获取。