Language models (LMs) trained on vast quantities of unlabelled data have greatly advanced the field of natural language processing (NLP). In this study, we re-visit the widely accepted notion in NLP that continued pre-training LMs on task-related texts improves the performance of fine-tuning (FT) in downstream tasks. Through experiments on eight single-sentence tasks and eight sentence-pair tasks in both semi-supervised and fully-supervised settings, we find that conventional continued pre-training does not consistently provide benefits and can even be detrimental for sentence-pair tasks or when prompt-based FT is used. To tackle these issues, we propose Prompt-based Continued Pre-training (PCP), which combines the idea of instruction tuning with conventional continued pre-training. Our approach aims to improve the performance of prompt-based FT by presenting both task-related texts and prompt templates to LMs through unsupervised pre-training objectives before fine-tuning for the target task. Our empirical evaluations on 21 benchmarks demonstrate that the PCP consistently improves the performance of state-of-the-art prompt-based FT approaches (up to 20.1% absolute) in both semi-supervised and fully-supervised settings, even with only hundreds of unlabelled examples. Additionally, prompt-based FT with the PCP outperforms state-of-the-art semi-supervised approaches with greater simplicity, eliminating the need for an iterative process and extra data augmentation. Our further analysis explores the performance lower bound of the PCP and reveals that the advantages of PCP persist across different sizes of models and datasets.
翻译:语言模型在大规模无标注数据上的训练极大地推动了自然语言处理领域的发展。在本研究中,我们重新审视了NLP中一个广泛接受的观点:在任务相关文本上继续预训练语言模型能提升下游任务微调的性能。通过在八个单句任务和八个句子对任务上的半监督与全监督设置实验,我们发现传统的继续预训练并非总能带来益处,甚至可能对句子对任务或基于提示的微调有害。为解决这些问题,我们提出了基于提示的继续预训练方法,该方法将指令微调的思想与传统继续预训练相结合。我们的方法旨在通过在下游任务微调前,对语言模型应用无监督预训练目标,同时呈现任务相关文本和提示模板,从而提升基于提示微调的性能。在21个基准上的实证评估表明,在半监督与全监督设置下,即使仅使用数百个无标注示例,我们的方法也能持续提升最先进的基于提示微调方法的性能(绝对提升最高达20.1%)。此外,结合了基于提示继续预训练的提示微调方法以更高的简单性超越了最先进的半监督方法,消除了迭代过程和额外数据增强的需求。进一步的分析探讨了该方法的性能下界,并揭示了其优势在不同模型大小和数据集规模下均能保持。