Masked pre-training removes random input dimensions and learns a model that can predict the missing values. Empirical results indicate that this intuitive form of self-supervised learning yields models that generalize very well to new domains. A theoretical understanding is, however, lacking. This paper shows that masked pre-training with a suitable cumulative scoring function corresponds to maximizing the model's marginal likelihood, which is de facto the Bayesian model selection measure of generalization. Beyond shedding light on the success of masked pre-training, this insight also suggests that Bayesian models can be trained with appropriately designed self-supervision. Empirically, we confirm the developed theory and explore the main learning principles of masked pre-training in large language models.
翻译:掩码预训练移除随机的输入维度,并学习一个能够预测缺失值的模型。实证结果表明,这种直观形式的自监督学习能够产生在新领域泛化能力极强的模型。然而,其理论基础仍然缺乏。本文证明,采用合适的累积评分函数进行掩码预训练,等价于最大化模型的边际似然,而边际似然本质上就是贝叶斯模型选择的泛化度量。除了揭示掩码预训练成功的原因外,这一见解还表明,贝叶斯模型可以通过适当设计的自监督方式进行训练。在实证方面,我们验证了所提出的理论,并探讨了大规模语言模型中掩码预训练的主要学习原则。