On the Provable Advantage of Unsupervised Pretraining

Unsupervised pretraining, which learns a useful representation using a large amount of unlabeled data to facilitate the learning of downstream tasks, is a critical component of modern large-scale machine learning systems. Despite its tremendous empirical success, the rigorous theoretical understanding of why unsupervised pretraining generally helps remains rather limited -- most existing results are restricted to particular methods or approaches for unsupervised pretraining with specialized structural assumptions. This paper studies a generic framework, where the unsupervised representation learning task is specified by an abstract class of latent variable models $\Phi$ and the downstream task is specified by a class of prediction functions $\Psi$. We consider a natural approach of using Maximum Likelihood Estimation (MLE) for unsupervised pretraining and Empirical Risk Minimization (ERM) for learning downstream tasks. We prove that, under a mild ''informative'' condition, our algorithm achieves an excess risk of $\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_\Phi/m} + \sqrt{\mathcal{C}_\Psi/n})$ for downstream tasks, where $\mathcal{C}_\Phi, \mathcal{C}_\Psi$ are complexity measures of function classes $\Phi, \Psi$, and $m, n$ are the number of unlabeled and labeled data respectively. Comparing to the baseline of $\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_{\Phi \circ \Psi}/n})$ achieved by performing supervised learning using only the labeled data, our result rigorously shows the benefit of unsupervised pretraining when $m \gg n$ and $\mathcal{C}_{\Phi\circ \Psi} > \mathcal{C}_\Psi$. This paper further shows that our generic framework covers a wide range of approaches for unsupervised pretraining, including factor models, Gaussian mixture models, and contrastive learning.

翻译：无监督预训练通过利用大量无标注数据学习有用表示以促进下游任务的学习，是现代大规模机器学习系统的关键组成部分。尽管其在实证上取得了巨大成功，但关于无监督预训练为何通常具有帮助作用的严格理论理解仍相当有限——现有结果大多局限于具有特定结构假设的特定方法或途径。本文研究了一个通用框架，其中无监督表示学习任务由一类抽象潜变量模型$\Phi$指定，下游任务由一类预测函数$\Psi$指定。我们考虑一种自然的方法：使用最大似然估计进行无监督预训练，并使用经验风险最小化学习下游任务。我们证明，在温和的“信息性”条件下，我们的算法在下游任务上的超额风险为$\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_\Phi/m} + \sqrt{\mathcal{C}_\Psi/n})$，其中$\mathcal{C}_\Phi, \mathcal{C}_\Psi$分别是函数类$\Phi, \Psi$的复杂度度量，$m, n$分别是无标注和有标注数据的数量。与仅使用有标注数据进行监督学习所达到的基线$\tilde{\mathcal{O}}(\sqrt{\mathcal{C}_{\Phi \circ \Psi}/n})$相比，当$m \gg n$且$\mathcal{C}_{\Phi\circ \Psi} > \mathcal{C}_\Psi$时，我们的结果严格证明了无监督预训练的益处。本文进一步表明，我们的通用框架涵盖了多种无监督预训练方法，包括因子模型、高斯混合模型和对比学习。