Self-supervised learning (SSL) learns high-quality representations from large pools of unlabeled training data. As datasets grow larger, it becomes crucial to identify the examples that contribute the most to learning such representations. This enables efficient SSL by reducing the volume of data required for learning high-quality representations. Nevertheless, quantifying the value of examples for SSL has remained an open question. In this work, we address this for the first time, by proving that examples that contribute the most to contrastive SSL are those that have the most similar augmentations to other examples, in expectation. We provide rigorous guarantees for the generalization performance of SSL on such subsets. Empirically, we discover, perhaps surprisingly, the subsets that contribute the most to SSL are those that contribute the least to supervised learning. Through extensive experiments, we show that our subsets outperform random subsets by more than 3% on CIFAR100, CIFAR10, and STL10. Interestingly, we also find that we can safely exclude 20% of examples from CIFAR100 and 40% from STL10, without affecting downstream task performance.
翻译:自监督学习(SSL)能从大量无标注训练数据中学习高质量表征。随着数据集规模增大,识别对学习此类表征贡献最大的样本变得至关重要,这有助于通过减少学习高质量表征所需的数据量来实现高效的SSL。然而,量化样本对SSL的价值一直是一个未解难题。本研究首次解决了这一问题,通过理论证明:在期望意义上,对对比SSL贡献最大的样本,是其与其他样本具有最相似数据增强的样本。我们给出了SSL在此类子集上泛化性能的严格保证。实验发现,也许令人惊讶的是,对SSL贡献最大的子集,恰恰是对监督学习贡献最小的子集。通过大量实验表明,在CIFAR100、CIFAR10和STL10数据集上,我们的子集比随机子集性能提升超过3%。有趣的是,我们还发现可以安全地从CIFAR100中剔除20%的样本、从STL10中剔除40%的样本,而不会影响下游任务性能。