We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data will be available at https://github.com/microsoft/LMOps.
翻译:我们探索了在特定领域语料上进行持续预训练对大语言模型的影响,揭示出直接训练原始语料虽然赋予模型领域知识,但严重损害其问答任务的提示能力。受人类通过阅读理解学习方式的启发(阅读后的练习能提升基于已学知识回答问题的能力),我们提出一种将原始语料转化为阅读理解文本的简单方法。每段原始文本均被赋予与其内容相关的系列任务。该方法具有高度可扩展性,适用于任何预训练语料,并在生物医学、金融和法律三个不同领域的多种任务中持续提升模型性能。值得注意的是,我们的7B语言模型在性能上可与规模更大的领域专用模型(如BloombergGPT-50B)相匹敌。此外,我们证明领域特定的阅读理解文本甚至能提升模型在通用基准测试中的表现,这展示了跨更多领域开发通用模型的潜力。我们的模型、代码及数据将开源至https://github.com/microsoft/LMOps。