Large language models exhibit promising general capabilities but often lack specialized knowledge for domain-specific tasks. Developing domain experts from a base model enables a range of applications without prohibitive training costs. This work demonstrates a method using continuous training and instruction fine-tuning to rapidly adapt Llama 2 base models to the Chinese medical domain. We first conduct continuous training on 1B tokens from Chinese medical references to teach relevant vocabulary and knowledge. The models are then fine-tuned on 54K examples sourced from the Chinese National Medical Licensing Examination. Experiments on Chinese medical data confirm the effectiveness of this approach, producing a model comparable to GPT-3.5-turbo while using way less computational resource. The resulting domain-specific model could be useful for various Chinese medical applications. More broadly, this provides a template for domain-specific training of large language models in areas where pre-trained models lack the required expertise, such as law, science, and engineering.
翻译:大语言模型展现出显著的通用能力,但在特定领域的专业任务中往往缺乏专门知识。基于基础模型开发领域专家模型,能在避免高昂训练成本的同时实现广泛应用。本研究提出了一种通过连续训练和指令微调将Llama 2基础模型快速适配至中文医学领域的方法。我们首先使用10亿个来自中文医学参考文献的标记进行连续训练,以教授相关词汇和知识。随后,利用源自中国国家医学考试题库的54,000个示例对模型进行微调。在中文医学数据上的实验证实了该方法的有效性,生成的模型在使用极低计算资源的情况下,性能可与GPT-3.5-turbo相媲美。所得到的领域特定模型可适用于多种中文医学应用场景。更广泛而言,本研究为法律、科学、工程等预训练模型缺乏专业知识的领域,提供了一种大语言模型的领域适配训练模板。