Recent advances in Large Language Models (LLMs) have achieved remarkable breakthroughs in understanding and responding to user intents. However, their performance lag behind general use cases in some expertise domains, such as Chinese medicine. Existing efforts to incorporate Chinese medicine into LLMs rely on Supervised Fine-Tuning (SFT) with single-turn and distilled dialogue data. These models lack the ability for doctor-like proactive inquiry and multi-turn comprehension and cannot always align responses with safety and professionalism experts. In this work, we introduce Zhongjing, the first Chinese medical LLaMA-based LLM that implements an entire training pipeline from pre-training to reinforcement learning with human feedback (RLHF). Additionally, we introduce a Chinese multi-turn medical dialogue dataset of 70,000 authentic doctor-patient dialogues, CMtMedQA, which significantly enhances the model's capability for complex dialogue and proactive inquiry initiation. We define a refined annotation rule and evaluation criteria given the biomedical domain's unique characteristics. Results show that our model outperforms baselines in various capacities and matches the performance of ChatGPT in a few abilities, despite having 50x training data with previous best model and 100x parameters with ChatGPT. RLHF further improves the model's instruction-following ability and safety. We also release our code, datasets and model for further research.
翻译:近期大语言模型(LLMs)在理解与响应用户意图方面取得了显著突破。然而,其在某些专业领域(如中医学)的性能仍落后于通用场景。现有将中医学融入大语言模型的尝试依赖于单轮蒸馏对话数据的监督微调(SFT)。这些模型缺乏类似医生的主动询问能力与多轮理解能力,且无法始终确保响应符合安全性与专业性的专家要求。本研究提出仲景(Zhongjing),首个基于LLaMA的中医大语言模型,实现了从预训练到基于人类反馈的强化学习(RLHF)的完整训练流程。此外,我们引入包含70,000条真实医患对话的中文多轮医疗对话数据集CMtMedQA,显著提升了模型处理复杂对话和发起主动询问的能力。针对生物医学领域的独特特性,我们定义了精细化标注规则与评估标准。结果表明,尽管训练数据量仅为先前最优模型的五十分之一、参数规模为ChatGPT的百分之一,我们的模型在多种能力上超越基线,并在部分能力上达到ChatGPT水平。RLHF进一步提升了模型的指令遵循能力与安全性。我们将公开代码、数据集及模型以促进后续研究。