Data augmentation is critical to the empirical success of modern self-supervised representation learning, such as contrastive learning and masked language modeling. However, a theoretical understanding of the exact role of the augmentation remains limited. Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator, suggesting that learning a linear probe atop such representation can be connected to RKHS regression. Building on this insight, this work delves into a statistical analysis of augmentation-based pretraining. Starting from the isometry property, a geometric characterization of the target function given by the augmentation, we disentangle the effects of the model and the augmentation, and prove two generalization bounds that are free of model complexity. Our first bound works for an arbitrary encoder, and it is the sum of an estimation error bound incurred by fitting a linear probe, and an approximation error bound by RKHS approximation. Our second bound specifically addresses the case where the encoder extracts the top-d eigenspace of a finite-sample-based approximation of the underlying RKHS. A key ingredient in our analysis is the augmentation complexity, which we use to quantitatively compare different augmentations and analyze their impact on downstream performance.
翻译:数据增强对于现代自监督表示学习(如对比学习和掩码语言建模)的经验成功至关重要。然而,关于数据增强确切作用的理论理解仍然有限。近期研究建立了自监督学习与图拉普拉斯算子顶部特征空间逼近之间的联系,表明在此类表示上学习线性探测可归结为RKHS回归问题。基于这一洞见,本文深入分析了基于增强的预训练统计特性。从等距性质(数据增强所定义目标函数的几何刻画)出发,我们解耦了模型与数据增强的效应,并证明了两个与模型复杂度无关的泛化界。第一个泛化界适用于任意编码器,其由拟合线性探测产生的估计误差界和RKHS逼近产生的近似误差界构成。第二个泛化界专门针对编码器提取底层RKHS基于有限样本逼近的顶部d维特征空间的情形。分析中的关键要素是数据增强复杂度,我们利用该指标定量比较不同数据增强方法,并分析其对下游性能的影响。