Pre-trained language models (PLMs) have established the new paradigm in the field of NLP. For more powerful PLMs, one of the most popular and successful way is to continuously scale up sizes of the models and the pre-training corpora. These large corpora are generally obtained by converging smaller ones from multiple sources, they are thus growing increasingly diverse. However, the side-effects of these colossal converged corpora remain understudied. In this paper, we identify the disadvantage of heterogeneous corpora from multiple sources for pre-training PLMs. Towards coordinated pre-training on diverse corpora, we further propose source prompts (SP), which explicitly prompt the model of the data source at the pre-training and fine-tuning stages. Results of extensive experiments demonstrate that PLMs pre-trained with SP on diverse corpora gain significant improvement in various downstream tasks.
翻译:预训练语言模型(PLMs)已在自然语言处理领域确立了新的范式。在构建更强大的PLMs时,最流行且有效的方法之一是持续扩展模型规模和预训练语料库。这些大型语料库通常由来自多个来源的较小语料库汇聚而成,因此其多样性日益增长。然而,这些庞大合并语料库的潜在负面影响尚未得到充分研究。本文揭示了多源异构语料对预训练PLMs的不利影响。为实现对多样化语料的协同预训练,我们进一步提出源提示(source prompts, SP),该方法在预训练和微调阶段显式提示模型数据来源信息。大量实验结果表明,在多样化语料上采用SP预训练的PLMs,在多种下游任务中均取得显著性能提升。