Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through our adaptation of Nvidia's Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines. Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines before and after task-specific finetuning. Overall, MEDITRON achieves a 6% absolute performance gain over the best public baseline in its parameter class and 3% over the strongest baseline we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of Med-PaLM-2. We release our code for curating the medical pretraining corpus and the MEDITRON model weights to drive open-source development of more capable medical LLMs.
翻译:大型语言模型(LLMs)有望实现医学知识的民主化获取。尽管已有诸多努力致力于挖掘和提升LLMs的医学知识与推理能力,但现有模型要么闭源(如PaLM、GPT-4),要么规模受限(参数≤130亿),限制了其能力。本研究通过发布MEDITRON系列开源LLMs(含70亿与700亿参数),提升大规模医学LLMs的可及性,该模型已针对医学领域进行适配。MEDITRON基于Llama-2(通过我们对NVIDIA Megatron-LM分布式训练器的适配改造),并在经全面整理的医学语料库上扩展预训练,包括精选PubMed论文、摘要及国际公认的医学指南。基于四项主流医学基准的评估表明,在任务专用微调前后,模型均较多个最先进基线获得显著性能提升。总体而言,MEDITRON在其参数类别的最佳公开基线上实现6%的绝对性能增益,比基于Llama-2微调的最强基线提升3%。与闭源LLMs相比,MEDITRON-70B超越GPT-3.5与Med-PaLM,与GPT-4的差距在5%以内,与Med-PaLM-2的差距在10%以内。我们开源医疗预训练语料库整理代码及MEDITRON模型权重,以推动更强大医学LLMs的开源生态发展。