This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tilde{\nu}$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tilde{\nu} \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.
翻译:本文提出一种新的统计分析,旨在解释自然语言处理中预训练技术近期取得的卓越成果。我们证明:当预训练任务(如掩码语言模型任务中的不同词汇)的类别具有充分多样性(即预训练最后一层线性层的最小奇异值 $\tilde{\nu}$ 较大)时,预训练可显著提升下游任务的样本效率。特别地,我们表明迁移学习的超额风险可达到 $O\left(\frac{1}{\tilde{\nu} \sqrt{n}}\right)$ 的收敛速率,而标准监督学习仅为 $O\left(\frac{1}{\sqrt{m}}\right)$ 的速率。其中 $n$ 为预训练数据量,$m$ 为下游任务数据量,且通常 $n \gg m$。我们的证明依赖于用于分解复合函数类的向量形式Rademacher复杂度链式法则及修正的自协方差条件。这些技术方法本身具有独立研究价值。