Recent large language models (LLMs) such as ChatGPT and LLaMA have shown great promise in many AI applications. However, their performance on medical tasks is suboptimal and can be improved by training on extensive domain-specific datasets. This study introduces Me LLaMA, a medical LLM family that includes foundation models - Me LLaMA 13/70B, along with their chat-enhanced versions - Me LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our domain-specific data suite for training and evaluation includes a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. Their zero-shot performance is comparable with ChatGPT across 7 out of 8 datasets, with a slight variance of within 3%, and yet falls short when compared to GPT-4. In addition, we investigated the catastrophic forgetting problem, and our results show that Me LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications. We release our models, datasets, and evaluation scripts at: https://github.com/BIDS-Xu-Lab/Me-LLaMA.
翻译:近期如ChatGPT和LLaMA等大语言模型(LLMs)在诸多人工智能应用中展现出巨大潜力,但其在医疗任务上的性能仍存在不足,可通过大规模领域特定数据集训练加以改进。本研究提出医疗LLM模型家族Me LLaMA,包含基础模型Me LLaMA 13/70B及其对话增强版本Me LLaMA 13/70B-chat,该系列通过对LLaMA2进行持续预训练和指令微调(使用大规模医疗数据集)开发而成。我们构建的领域特定训练与评估数据套件包括:含1290亿词元的持续预训练数据集、含21.4万样本的指令微调数据集,以及涵盖6项任务、12个数据集的新型医疗评估基准(MIBE)。基于MIBE的全面评估表明,Me LLaMA模型在零样本、少样本和全监督学习能力上整体优于现有开源医疗LLM。其在8个数据集中的7个上零样本性能与ChatGPT相当(差异幅度在3%以内),但相较GPT-4仍存在差距。此外,我们针对灾难性遗忘问题的研究表明,Me LLaMA模型在缓解该问题上优于其他开源医疗LLM。作为同时使用生物医学和临床数据的最大的开源医疗基础LLM之一,Me LLaMA在通用与医疗任务上均展现出优于其他开源医疗LLM的卓越性能,成为医疗AI应用的理想选择。我们已在https://github.com/BIDS-Xu-Lab/Me-LLaMA 开源模型、数据集及评估脚本。