Data augmentation is critical to the empirical success of modern self-supervised representation learning, such as contrastive learning and masked language modeling. However, a theoretical understanding of the exact role of augmentation remains limited. Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator, suggesting that learning a linear probe atop such representation can be connected to RKHS regression. Building on this insight, this work delves into a statistical analysis of augmentation-based pretraining. Starting from the isometry property, a geometric characterization of the target function given by the augmentation, we disentangle the effects of the model and the augmentation, and prove two generalization bounds that are free of model complexity. Our first bound works for an arbitrary encoder, where the prediction error is decomposed as the sum of an estimation error incurred by fitting a linear probe with RKHS regression, and an approximation error entailed by RKHS approximation. Our second bound specifically addresses the case where the encoder is near-optimal, that is it approximates the top-d eigenspace of the RKHS induced by the augmentation. A key ingredient in our analysis is the augmentation complexity, which we use to quantitatively compare different augmentations and analyze their impact on downstream performance.
翻译:数据增强对于现代自监督表示学习(如对比学习和掩码语言建模)的经验成功至关重要。然而,关于增强确切作用的理论理解仍较为有限。近期研究建立了自监督学习与图拉普拉斯算子顶部特征空间近似之间的联系,表明在此类表示之上学习线性探测可与RKHS回归相关联。基于这一见解,本文深入探讨了基于增强的预训练的统计分析。从等距性质(由增强所定义的目标函数的几何表征)出发,我们分离了模型与增强的影响,并证明了两个与模型复杂度无关的泛化界。第一个界适用于任意编码器,其中预测误差被分解为通过RKHS回归拟合线性探测产生的估计误差,以及由RKHS近似引起的逼近误差。第二个界专门针对编码器接近最优的情况,即编码器近似了由增强诱导的RKHS的顶部d维特征空间。我们分析中的一个关键要素是增强复杂度,我们用它来定量比较不同增强方法,并分析它们对下游性能的影响。