This thesis focuses on improving the pre-training of natural language models using unsupervised raw data to make them more efficient and aligned with downstream applications. In the first part, we introduce three alternative pre-training objectives to BERT's Masked Language Modeling (MLM), namely Random Token Substitution (RTS), Cluster-based Random Token Substitution (C-RTS), and Swapped Language Modeling (SLM). These objectives involve token swapping instead of masking, with RTS and C-RTS aiming to predict token originality and SLM predicting the original token values. Results show that RTS and C-RTS require less pre-training time while maintaining performance comparable to MLM. Surprisingly, SLM outperforms MLM on certain tasks despite using the same computational budget. In the second part, we proposes self-supervised pre-training tasks that align structurally with downstream applications, reducing the need for labeled data. We use large corpora like Wikipedia and CC-News to train models to recognize if text spans originate from the same paragraph or document in several ways. By doing continuous pre-training, starting from existing models like RoBERTa, ELECTRA, DeBERTa, BART, and T5, we demonstrate significant performance improvements in tasks like Fact Verification, Answer Sentence Selection, and Summarization. These improvements are especially pronounced when limited annotation data is available. The proposed objectives also achieve state-of-the-art results on various benchmark datasets, including FEVER (dev set), ASNQ, WikiQA, and TREC-QA, as well as enhancing the quality of summaries. Importantly, these techniques can be easily integrated with other methods without altering the internal structure of Transformer models, making them versatile for various NLP applications.
翻译:本论文聚焦于利用无监督原始数据改进自然语言模型的预训练,以提升其效率并使其更贴合下游应用。第一部分中,我们提出了三种替代BERT掩码语言建模(MLM)的预训练目标,即随机令牌替换(RTS)、基于聚类的随机令牌替换(C-RTS)和交换语言建模(SLM)。这些目标采用令牌交换而非掩码策略,其中RTS和C-RTS旨在预测令牌的原始性,而SLM则预测原始令牌值。结果表明,RTS和C-RTS在保持与MLM相当性能的同时,所需预训练时间更少。令人惊讶的是,SLM在相同计算预算下,部分任务的表现优于MLM。第二部分中,我们提出了与下游应用结构对齐的自监督预训练任务,从而减少对标注数据的需求。我们利用维基百科和CC-News等大型语料库,通过多种方式训练模型识别文本片段是否源自同一段落或文档。通过基于现有模型(如RoBERTa、ELECTRA、DeBERTa、BART和T5)进行连续预训练,我们在事实验证、答案句选择和摘要生成等任务中展示了显著的性能提升。当标注数据有限时,这些改进尤为突出。所提出的目标还在多个基准数据集(包括FEVER开发集、ASNQ、WikiQA和TREC-QA)上取得了最优结果,并提升了摘要质量。值得注意的是,这些技术无需改变Transformer模型的内部结构即可与其他方法轻松集成,从而适用于各类自然语言处理应用。