Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model's performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the "stability gap," previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single epoch; (2) Pre-training the LLM only on high-quality sub-corpus, which rapidly boosts domain performance; and (3) Using a data mixture similar to the pre-training data to reduce distribution gap. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models, and performs comparably to or even better than GPT-4 on several medical benchmarks. We release our models at \url{https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct}.
翻译:持续预训练已日益成为将大语言模型(LLM)适配至新领域的主流方法。该过程通过使用新领域语料库更新预训练的LLM,导致训练分布发生偏移。为研究LLM在此偏移过程中的行为,我们测量了模型在整个持续预训练过程中的性能表现。我们观察到在初始阶段会出现暂时的性能下降,随后进入恢复阶段,这一现象被称为“稳定性差距”——此前在视觉模型分类新类别时已被发现。为解决该问题并在固定计算预算内提升LLM性能,我们提出三种有效策略:(1)在适当规模的子集上对LLM进行多轮持续预训练,相较于在大规模语料上进行单轮预训练能实现更快的性能恢复;(2)仅在高质量子语料库上预训练LLM,可快速提升领域性能;(3)使用与预训练数据相似的数据混合以减少分布差距。我们在Llama系列模型上进行了多组实验,以验证所提策略在医学持续预训练和指令微调中的有效性。例如,我们的策略使OpenLlama-3B模型的平均医学任务性能从36.2%提升至40.7%,且仅消耗原训练预算的40%,同时在未引发灾难性遗忘的情况下提升了平均通用任务性能。此外,我们将这些策略应用于Llama-3-8B模型。所得模型Llama-3-Physician在当前开源模型中取得了最佳医学性能,并在多项医学基准测试中达到甚至超越GPT-4的表现。模型已发布于\url{https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct}。