Contrastive representation learning is a modern paradigm for learning representations of unlabeled data via augmentations -- precisely, contrastive models learn to embed semantically similar pairs of samples (positive pairs) closer than independently drawn samples (negative samples). In spite of its empirical success and widespread use in foundation models, statistical theory for contrastive learning remains less explored. Recent works have developed generalization error bounds for contrastive losses, but the resulting risk certificates are either vacuous (certificates based on Rademacher complexity or $f$-divergence) or require strong assumptions about samples that are unreasonable in practice. The present paper develops non-vacuous PAC-Bayesian risk certificates for contrastive representation learning, considering the practical considerations of the popular SimCLR framework. Notably, we take into account that SimCLR reuses positive pairs of augmented data as negative samples for other data, thereby inducing strong dependence and making classical PAC or PAC-Bayesian bounds inapplicable. We further refine existing bounds on the downstream classification loss by incorporating SimCLR-specific factors, including data augmentation and temperature scaling, and derive risk certificates for the contrastive zero-one risk. The resulting bounds for contrastive loss and downstream prediction are much tighter than those of previous risk certificates, as demonstrated by experiments on CIFAR-10.
翻译:对比表示学习是一种通过数据增强学习无标签数据表示的现代范式——具体而言,对比模型学习将语义相似的样本对(正样本对)嵌入到比独立抽取样本(负样本)更接近的空间位置。尽管该方法在基础模型中取得了实证成功并得到广泛应用,对比学习的统计理论仍较少被探索。近期研究已针对对比损失提出了泛化误差界,但所得风险证明要么是空洞的(基于Rademacher复杂度或$f$-散度的证明),要么需要关于样本的强假设,而这些假设在实践中并不合理。本文针对对比表示学习提出了非空洞的PAC贝叶斯风险证明,并考虑了流行框架SimCLR的实际特性。值得注意的是,我们考虑到SimCLR将增强数据的正样本对复用为其他数据的负样本,从而引入了强依赖性,使得经典PAC或PAC贝叶斯界不再适用。我们通过融入SimCLR特有的因素(包括数据增强和温度缩放)进一步改进了现有下游分类损失的边界,并推导出对比0-1风险的风险证明。在CIFAR-10数据集上的实验表明,所得对比损失与下游预测的边界较先前风险证明的边界更为紧致。