For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE and data2vec, randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional ``supervised learning'' (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.
翻译:对于无监督预训练,掩码重建预训练(MRP)方法(如MAE和data2vec)会随机遮蔽输入块,然后通过自编码器重建这些被遮蔽块的像素或语义特征。在下游任务中,对预训练编码器进行监督微调,其性能显著优于传统的从零开始训练的“监督学习”(SL)。然而,目前仍不清楚:1)MRP在预训练阶段如何进行语义特征学习;2)为何它有助于下游任务。为解决这些问题,我们首先从理论上证明:在采用两层/单层卷积编码器/解码器的自编码器架构上,MRP能够捕获预训练数据集中每个潜在语义类别的所有判别性特征。考虑到预训练数据集规模庞大且多样性高,因此涵盖了大多数下游数据集的特征,在微调阶段,预训练编码器可以尽可能多地捕获下游数据集中的特征,并且理论上保证不会丢失这些特征。相比之下,由于彩票假设,SL仅能随机捕获部分特征。因此,MRP在分类任务上理论上实现了优于SL的性能。实验结果验证了我们的数据假设以及理论推论。