Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for model training. Contrastive learning is one popular method for self-supervised learning and has achieved promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(\sigma,\delta)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error rate based on the measure. It reveals that the generalization ability of contrastive self-supervised learning is related to three key factors: alignment of positive samples, divergence of class centers, and concentration of augmented data. The first two factors are properties of learned representations, while the third one is determined by pre-defined data augmentation. We further investigate two canonical contrastive losses, InfoNCE and cross-correlation, to show how they provably achieve the first two factors. Moreover, we conduct experiments to study the third factor, and observe a strong correlation between downstream performance and the concentration of augmented data.
翻译:最近,自监督学习因其仅需无标签数据即可进行模型训练而受到广泛关注。对比学习是自监督学习的一种流行方法,并在实证中取得了显著成效。然而,对其泛化能力的理论理解仍然有限。为此,我们定义了一种$(\sigma,\delta)$-度量来对数据增强进行数学量化,并基于该度量给出了下游分类错误率的上界。研究表明,对比自监督学习的泛化能力与三个关键因素相关:正样本的对齐性、类别中心的发散性以及增强数据的集中性。前两个因素是学习到的表示的特性,而第三个因素由预定义的数据增强决定。我们进一步研究了两种经典的对比损失函数——InfoNCE和互相关损失,以证明它们如何实现前两个因素。此外,我们通过实验探讨了第三个因素,并观察到下游性能与增强数据集中性之间存在强相关性。