Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Language Models in both monolingual and multilingual contexts. Our method offers practical advantages, substantially reducing training computational requirements by up to 20 times, while simultaneously enhancing downstream performance and data efficiency. We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.
翻译:自监督语言模型预训练通常涉及对大规模词汇表进行概率分布预测。本研究提出一种创新方法,摒弃概率预测范式,转而通过对比权重绑定(Contrastive Weight Tying, CWT)以对比方式重构输入嵌入。我们将该方法应用于单语与多语场景中无头语言模型的预训练。该方法具有实践优势:在将训练计算需求大幅降低至原有的二十分之一的同时,提升下游任务性能与数据效率。在相近计算预算下,我们观察到相较于经典语言模型,GLUE评分显著提升+1.6,LAMBADA准确率提升+2.7。