While recent advancements in commercial large language models (LM) have shown promising results in medical tasks, their closed-source nature poses significant privacy and security concerns, hindering their widespread use in the medical field. Despite efforts to create open-source models, their limited parameters often result in insufficient multi-step reasoning capabilities required for solving complex medical problems. To address this, we introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters. The models were trained using our new synthetic dataset consisting of high-quality chain-of-thought reasoning paths sourced from 18 medical textbooks, along with diverse instruction-following datasets. Our systems achieved remarkable accuracy across six medical benchmarks, surpassing the previous best models such as MediTron and BioMistral, and GPT-3.5 by a large margin. Notably, Meerkat-7B surpassed the passing threshold of the United States Medical Licensing Examination (USMLE) for the first time for a 7B-parameter model, while Meerkat-70B outperformed GPT-4 by an average of 1.3%. Additionally, Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8 and closely matching GPT-4's 21.8. Our systems offered more detailed free-form responses to clinical queries compared to existing small models, approaching the performance level of large commercial models. This significantly narrows the performance gap with large LMs, showcasing its effectiveness in addressing complex medical challenges.
翻译:尽管近期商业大型语言模型(LM)在医疗任务中展现出良好前景,但其闭源特性引发了严重的隐私与安全隐患,阻碍了其在医疗领域的广泛应用。尽管开源模型开发已取得进展,但其有限的参数量往往导致解决复杂医学问题所需的多步推理能力不足。为此,我们推出了Meerkat系列医疗人工智能系统,参数量覆盖70亿至700亿规模。该系列模型采用我们构建的新型合成数据集进行训练,该数据集包含从18部医学教材中提取的高质量思维链推理路径,并融合了多样化的指令遵循数据集。我们的系统在六项医疗基准测试中取得了卓越的准确率,大幅超越了MediTron、BioMistral及GPT-3.5等先前最优模型。值得注意的是,Meerkat-7B首次以70亿参数规模突破了美国医师执照考试(USMLE)的合格分数线,而Meerkat-70B的平均表现较GPT-4高出1.3%。此外,在38例复杂临床病例诊断中,Meerkat-70B正确诊断21例,优于人类医生的13.8例,并与GPT-4的21.8例表现接近。相较于现有小型模型,我们的系统能对临床问题提供更详尽的自由形式回答,其性能已接近大型商业模型水平。这显著缩小了与大型语言模型的性能差距,彰显了其在应对复杂医疗挑战方面的有效性。