Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks

While recent advancements in commercial large language models (LM) have shown promising results in medical tasks, their closed-source nature poses significant privacy and security concerns, hindering their widespread use in the medical field. Despite efforts to create open-source models, their limited parameters often result in insufficient multi-step reasoning capabilities required for solving complex medical problems. To address this, we introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters. The models were trained using our new synthetic dataset consisting of high-quality chain-of-thought reasoning paths sourced from 18 medical textbooks, along with diverse instruction-following datasets. Our systems achieved remarkable accuracy across six medical benchmarks, surpassing the previous best models such as MediTron and BioMistral, and GPT-3.5 by a large margin. Notably, Meerkat-7B surpassed the passing threshold of the United States Medical Licensing Examination (USMLE) for the first time for a 7B-parameter model, while Meerkat-70B outperformed GPT-4 by an average of 1.3%. Additionally, Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8 and closely matching GPT-4's 21.8. Our systems offered more detailed free-form responses to clinical queries compared to existing small models, approaching the performance level of large commercial models. This significantly narrows the performance gap with large LMs, showcasing its effectiveness in addressing complex medical challenges.

翻译：尽管近期商业大型语言模型（LM）在医疗任务中展现出良好前景，但其闭源特性引发了严重的隐私与安全隐患，阻碍了其在医疗领域的广泛应用。尽管开源模型开发已取得进展，但其有限的参数量往往导致解决复杂医学问题所需的多步推理能力不足。为此，我们推出了Meerkat系列医疗人工智能系统，参数量覆盖70亿至700亿规模。该系列模型采用我们构建的新型合成数据集进行训练，该数据集包含从18部医学教材中提取的高质量思维链推理路径，并融合了多样化的指令遵循数据集。我们的系统在六项医疗基准测试中取得了卓越的准确率，大幅超越了MediTron、BioMistral及GPT-3.5等先前最优模型。值得注意的是，Meerkat-7B首次以70亿参数规模突破了美国医师执照考试（USMLE）的合格分数线，而Meerkat-70B的平均表现较GPT-4高出1.3%。此外，在38例复杂临床病例诊断中，Meerkat-70B正确诊断21例，优于人类医生的13.8例，并与GPT-4的21.8例表现接近。相较于现有小型模型，我们的系统能对临床问题提供更详尽的自由形式回答，其性能已接近大型商业模型水平。这显著缩小了与大型语言模型的性能差距，彰显了其在应对复杂医疗挑战方面的有效性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日