Recently, large language model (LLM) based artificial intelligence (AI) systems have demonstrated remarkable capabilities in natural language understanding and generation. However, these models face a significant challenge when it comes to sensitive applications, such as reasoning over medical knowledge and answering medical questions in a physician-like manner. Prior studies attempted to overcome this challenge by increasing the model size (>100B) to learn more general medical knowledge, while there is still room for improvement in LLMs with smaller-scale model sizes (<100B). In this work, we start from a pre-trained general LLM model (AntGLM-10B) and fine-tune it from a medical beginner towards a medical expert (called AntGLM-Med-10B), which leverages a 3-stage optimization procedure, i.e., general medical knowledge injection, medical domain instruction tuning, and specific medical task adaptation. Our contributions are threefold: (1) We specifically investigate how to adapt a pre-trained general LLM in medical domain, especially for a specific medical task. (2) We collect and construct large-scale medical datasets for each stage of the optimization process. These datasets encompass various data types and tasks, such as question-answering, medical reasoning, multi-choice questions, and medical conversations. (3) Specifically for multi-choice questions in the medical domain, we propose a novel Verification-of-Choice approach for prompting engineering, which significantly enhances the reasoning ability of LLMs. Remarkably, by combining the above approaches, our AntGLM-Med-10B model can outperform the most of LLMs on PubMedQA, including both general and medical LLMs, even when these LLMs have larger model size.
翻译:近期,基于大语言模型(LLM)的人工智能系统在自然语言理解和生成方面展现出卓越能力。然而,这些模型在医疗推理、以医师风格回答医学问题等敏感应用场景中仍面临显著挑战。现有研究尝试通过扩大模型规模(超过100B参数)来学习更广泛的医学知识,但规模较小的LLM(小于100B参数)仍有提升空间。本研究从预训练的通用LLM模型(AntGLM-10B)出发,通过三阶段优化流程(通用医学知识注入、医学领域指令微调、特定医学任务适配)将其从医学初学者逐步训练为医学专家(称AntGLM-Med-10B)。我们的贡献体现在三个方面:(1)系统研究了如何将预训练通用LLM适配至医学领域,特别是针对特定医学任务;(2)为优化流程各阶段收集并构建了大规模医学数据集,涵盖问答、医学推理、多选题及医学对话等多种数据类型与任务;(3)针对医学领域多选题,创新提出基于验证选择(Verification-of-Choice)的提示工程方法,显著提升了LLM的推理能力。值得关注的是,通过整合上述方法,我们的AntGLM-Med-10B模型在PubMedQA基准测试中超越了包括更大规模通用和医学LLM在内的大多数模型。