Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
翻译:迁移学习是一种先在数据丰富的任务上预训练模型,再针对下游任务进行微调的技术,已成为自然语言处理(NLP)中的强大手段。迁移学习的有效性催生了多种方法、技术和实践。在本文中,我们通过引入一个统一的框架,将所有基于文本的语言问题转换为“文本到文本”格式,系统探索了NLP领域的迁移学习技术。我们的系统性研究比较了预训练目标、架构、无标签数据集、迁移方法等多种因素,并在数十个语言理解任务上进行了分析。通过将探索中的见解与模型规模以及我们新构建的“大规模清洁爬取语料库”(Colossal Clean Crawled Corpus)相结合,我们在涵盖摘要生成、问答、文本分类等多个基准测试上取得了最先进的结果。为促进未来NLP迁移学习的研究,我们发布了我们的数据集、预训练模型及代码。