Recent large language models (LLMs) such as ChatGPT and LLaMA have shown great promise in many AI applications. However, their performance on medical tasks is suboptimal and can be improved by training on extensive domain-specific datasets. This study introduces Me LLaMA, a medical LLM family that includes foundation models - Me LLaMA 13/70B, along with their chat-enhanced versions - Me LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our domain-specific data suite for training and evaluation includes a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. Their zero-shot performance is comparable with ChatGPT across 7 out of 8 datasets, with a slight variance of within 3%, and yet falls short when compared to GPT-4. In addition, we investigated the catastrophic forgetting problem, and our results show that Me LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications. We release our models, datasets, and evaluation scripts at: https://github.com/BIDS-Xu-Lab/Me-LLaMA.
翻译:近期,ChatGPT和LLaMA等大语言模型(LLMs)在众多人工智能应用中展现出巨大潜力。然而,它们在医疗任务上的表现尚不理想,可通过在大量领域特定数据集上训练加以提升。本研究提出Me LLaMA医疗大语言模型家族,包含基础模型Me LLaMA 13/70B及其对话增强版本Me LLaMA 13/70B-chat,这些模型通过对LLaMA2进行持续预训练和指令微调,并使用大规模医疗数据集开发而成。我们的领域特定训练与评估数据集包括:包含129B词元的超大规模持续预训练数据集、包含214k样本的指令微调数据集,以及涵盖六项任务12个数据集的新型医疗评估基准(MIBE)。利用MIBE进行的全面评估表明,Me LLaMA模型在零样本、少样本和监督学习能力上均优于现有开源医疗LLMs。在8个数据集的7个上,其零样本性能与ChatGPT相当(差异在3%以内),但相较GPT-4仍存在差距。此外,我们研究了灾难性遗忘问题,结果表明Me LLaMA模型在缓解该问题方面优于其他开源医疗LLMs。Me LLaMA是目前最大的同时使用生物医学和临床数据的开源医疗基础LLM之一。与其它开源医疗LLMs相比,它在通用和医疗任务上均展现出卓越性能,成为医疗人工智能应用的理想选择。我们已在以下网址发布模型、数据集和评估脚本:https://github.com/BIDS-Xu-Lab/Me-LLaMA。