Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of theoretical work has begun to analyze this paradigm, existing bounds leave open the question of how sharp the current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage M-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the intrinsic parameters of the pre-training representation, which we link with the downstream predictor through a notion of orbit-invariance, precisely characterizing the limiting distribution of the downstream test risk. We apply our main result to several case studies, including spectral pre-training, factor models, and Gaussian mixture models, and obtain substantial improvements in problem-specific factors over prior art when applicable.
翻译:自监督预训练利用大规模无标签数据学习表示以供下游微调,已成为现代机器学习的基石。尽管日益增长的理论工作已开始分析这一范式,但现有边界仍留有一个未解问题:当前收敛速率究竟有多精确,以及它们能否准确刻画预训练与微调之间复杂的相互作用。本文通过发展基于两阶段M估计的预训练渐近理论来填补这一空白。一个关键挑战在于预训练估计量通常仅在群对称性意义下可识别——这是表示学习中需要谨慎处理的常见特征。我们利用黎曼几何工具来解决该问题,研究预训练表示的内在参数,并通过轨道不变性概念将其与下游预测器建立联系,从而精确刻画下游测试风险极限分布的特征。我们将主要结论应用于多个案例研究,包括谱预训练、因子模型和高斯混合模型,并在适用时获得了相较于现有方法在问题特定因子上的显著改进。