Contrastive learning -- a modern approach to extract useful representations from unlabeled data by training models to distinguish similar samples from dissimilar ones -- has driven significant progress in foundation models. In this work, we develop a new theoretical framework for analyzing data augmentation-based contrastive learning, with a focus on SimCLR as a representative example. Our approach is based on the concept of \emph{approximate sufficient statistics}, which we extend beyond its original definition in \cite{oko2025statistical} for contrastive language-image pretraining (CLIP) using KL-divergence. We generalize it to equivalent forms and general f-divergences, and show that minimizing SimCLR and other contrastive losses yields encoders that are approximately sufficient. Furthermore, we demonstrate that these near-sufficient encoders can be effectively adapted to downstream regression and classification tasks, with performance depending on their sufficiency and the error induced by data augmentation in contrastive learning. Concrete examples in linear regression and topic classification are provided to illustrate the broad applicability of our results.
翻译:对比学习——一种通过训练模型区分相似样本与不相似样本来从无标注数据中提取有用表示的现代方法——已推动基础模型取得显著进展。本文针对基于数据增强的对比学习(以SimCLR为代表实例)构建了一个新的理论分析框架。我们的方法基于**近似充分统计量**的概念,该概念最初在文献\cite{oko2025statistical}中针对使用KL散度的对比语言-图像预训练(CLIP)提出,我们将其扩展至等价形式及一般的f-散度。我们证明了最小化SimCLR及其他对比损失所得的编码器具有近似充分性。此外,我们论证了这些近似充分编码器能够有效适应下游回归与分类任务,其性能取决于编码器的充分性以及对比学习中数据增强引入的误差。文中通过线性回归与主题分类的具体实例,阐明了我们结论的广泛适用性。