Good data augmentation is one of the key factors that lead to the empirical success of self-supervised representation learning such as contrastive learning and masked language modeling, yet theoretical understanding of its role in learning good representations remains limited. Recent work has built the connection between self-supervised learning and approximating the top eigenspace of a graph Laplacian operator. Learning a linear probe on top of such features can naturally be connected to RKHS regression. In this work, we use this insight to perform a statistical analysis of augmentation-based pretraining. We start from the isometry property, a key geometric characterization of the target function given by the augmentation. Our first main theorem provides, for an arbitrary encoder, near tight bounds for both the estimation error incurred by fitting the linear probe on top of the encoder, and the approximation error entailed by the fitness of the RKHS the encoder learns. Our second main theorem specifically addresses the case where the encoder extracts the top-d eigenspace of a Monte-Carlo approximation of the underlying kernel with the finite pretraining samples. Our analysis completely disentangles the effects of the model and the augmentation. A key ingredient in our analysis is the augmentation complexity, which we use to quantitatively compare different augmentations and analyze their impact on downstream performance on synthetic and real datasets.
翻译:良好的数据增强是导致自监督表示学习(如对比学习和掩码语言建模)取得经验成功的关键因素之一,但其在学习优质表示中的作用仍缺乏理论理解。近期研究建立了自监督学习与图拉普拉斯算子主特征空间近似之间的联系。基于此类特征训练线性探测问题可自然关联于再生核希尔伯特空间(RKHS)回归。本研究利用这一洞见对基于增强的预训练进行统计分析。我们从等距性质入手——这是由增强所定义的目标函数的关键几何表征。第一主要定理为任意编码器提供了逼近紧界,同时涵盖拟合编码器顶部线性探测产生的估计误差,以及编码器所学习RKHS适应度引发的近似误差。第二主要定理专门探讨编码器通过有限预训练样本提取底层核蒙特卡洛近似的top-d特征空间的情况。我们的分析完全解耦了模型与增强效应。分析中的关键要素是增强复杂度,我们据此定量比较不同增强方法,并在合成数据集和真实数据集上分析其对下游任务性能的影响。