ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

Recently, the increasing demand for superior medical services has highlighted the discrepancies in the medical infrastructure. With big data, especially texts, forming the foundation of medical services, there is an exigent need for effective natural language processing (NLP) solutions tailored to the healthcare domain. Conventional approaches leveraging pre-trained models present promising results in this domain and current large language models (LLMs) offer advanced foundation for medical text processing. However, most medical LLMs are trained only with supervised fine-tuning (SFT), even though it efficiently empowers LLMs to understand and respond to medical instructions but is ineffective in learning domain knowledge and aligning with human preference. Another engineering barrier that prevents current medical LLM from better text processing ability is their restricted context length (e.g., 2,048 tokens), making it hard for the LLMs to process long context, which is frequently required in the medical domain. In this work, we propose ChiMed-GPT, a new benchmark LLM designed explicitly for Chinese medical domain, with enlarged context length to 4,096 tokens and undergoes a comprehensive training regime with pre-training, SFT, and RLHF. Evaluations on real-world tasks including information extraction, question answering, and dialogue generation demonstrate ChiMed-GPT's superior performance over general domain LLMs. Furthermore, we analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients, so as to contribute to further responsible development of LLMs in the medical domain. The code and model are released at https://github.com/synlp/ChiMed-GPT.

翻译：近年来，对优质医疗服务的需求日益增长，凸显了医疗基础设施中的差距。以大数据（尤其是文本）作为医疗服务基础的情况下，迫切需要针对医疗健康领域量身定制的有效自然语言处理（NLP）解决方案。利用预训练模型的传统方法在该领域展现出有前景的结果，而当前的大语言模型（LLMs）为医疗文本处理提供了先进的基础。然而，大多数医学LLM仅通过监督式微调（SFT）进行训练，尽管SFT能有效让LLM理解和响应医学指令，但在学习领域知识和与人类偏好对齐方面效果不佳。另一个阻碍当前医学LLM提升文本处理能力的工程障碍是其有限的上下文长度（例如2,048个token），这使得LLM难以处理医疗领域频繁需要的长上下文。在这项工作中，我们提出了ChiMed-GPT，这是一种专门为中文医疗领域设计的全新基准LLM，其上下文长度扩展至4,096个token，并经历了包括预训练、SFT和RLHF的全面训练流程。在真实世界任务（包括信息抽取、问答和对话生成）上的评估表明，ChiMed-GPT的性能优于通用领域LLM。此外，我们通过提示ChiMed-GPT执行关于患者歧视的态度量表来分析可能的偏见，从而为医疗领域LLM的进一步负责任发展做出贡献。代码和模型已发布于https://github.com/synlp/ChiMed-GPT。