Recently, Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding. While demonstrating proficiency in everyday conversations and question-answering situations, these models frequently struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. In this paper, we describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA. Our contributions are threefold: (i) we systematically investigate the process of adapting a general-purpose foundation language model towards medical domain, this involves data-centric knowledge injection through the integration of 4.8M biomedical academic papers and 30K medical textbooks, as well as comprehensive fine-tuning for alignment with domain-specific instructions; (ii) we contribute a large-scale, comprehensive dataset for instruction tuning. This dataset encompasses medical question-answering (QA), rationale for reasoning, and conversational dialogues, comprising a total of 202M tokens; (iii) we conduct thorough ablation studies to demonstrate the effectiveness of each proposed component. While evaluating on various public medical question-answering benchmarks, our lightweight PMCLLaMA, which consists of only 13 billion parameters, exhibits superior performance, even surpassing ChatGPT. All models, codes, datasets can be found in https://github.com/chaoyi-wu/PMC-LLaMA.
翻译:近期,大型语言模型(LLMs)在自然语言理解领域展现出卓越能力。尽管在日常对话和问答场景中表现出色,这些模型因缺乏领域特定知识,常在需要精准度的医学应用中表现不佳。本文详细描述了构建面向医学应用的开源强语言模型PMC-LLaMA的流程。我们的贡献体现在三个方面:(i)系统研究了将通用基础语言模型适配至医学领域的过程,包括通过集成480万篇生物医学学术论文与3万本医学教材进行数据驱动的知识注入,以及针对领域指令进行深度微调;(ii)构建了大规模综合性指令微调数据集,涵盖医学问答(QA)、推理依据及对话文本,共包含2.02亿个标记;(iii)通过充分消融实验验证了每个模块的有效性。在多项公开医学问答基准测试中,我们仅有130亿参数的轻量级PMC-LLaMA模型展现出卓越性能,甚至超越ChatGPT。所有模型、代码及数据集均可通过https://github.com/chaoyi-wu/PMC-LLaMA获取。