The application of Large Language Models (LLMs) to various clinical applications has attracted growing research attention. However, real-world clinical decision-making differs significantly from the standardized, exam-style scenarios commonly used in current efforts. In this paper, we present the RiskAgent system to perform a broad range of medical risk predictions, covering over 387 risk scenarios across diverse complex diseases, e.g., cardiovascular disease and cancer. RiskAgent is designed to collaborate with hundreds of clinical decision tools, i.e., risk calculators and scoring systems that are supported by evidence-based medicine. To evaluate our method, we have built the first benchmark MedRisk specialized for risk prediction, including 12,352 questions spanning 154 diseases, 86 symptoms, 50 specialties, and 24 organ systems. The results show that our RiskAgent, with 8 billion model parameters, achieves 76.33% accuracy, outperforming the most recent commercial LLMs, o1, o3-mini, and GPT-4.5, and doubling the 38.39% accuracy of GPT-4o. On rare diseases, e.g., Idiopathic Pulmonary Fibrosis (IPF), RiskAgent outperforms o1 and GPT-4.5 by 27.27% and 45.46% accuracy, respectively. Finally, we further conduct a generalization evaluation on an external evidence-based diagnosis benchmark and show that our RiskAgent achieves the best results. These encouraging results demonstrate the great potential of our solution for diverse diagnosis domains. To improve the adaptability of our model in different scenarios, we have built and open-sourced a family of models ranging from 1 billion to 70 billion parameters. Our code, data, and models are all available at https://github.com/AI-in-Health/RiskAgent.
翻译:大型语言模型(LLM)在各类临床应用中的研究日益受到关注。然而,现实世界的临床决策与当前研究中普遍采用的标准化、考试式场景存在显著差异。本文提出RiskAgent系统,用于执行广泛的医疗风险预测,涵盖包括心血管疾病与癌症在内的多种复杂疾病,涉及超过387种风险场景。RiskAgent设计用于与数百种基于循证医学支持的临床决策工具(即风险计算器和评分系统)协同工作。为评估本方法,我们构建了首个专注于风险预测的基准测试集MedRisk,包含12,352道问题,涵盖154种疾病、86种症状、50个专科及24个器官系统。实验结果表明,我们拥有80亿参数的RiskAgent模型取得了76.33%的准确率,优于最新的商用LLM(包括o1、o3-mini和GPT-4.5),并将GPT-4o的38.39%准确率提升了一倍。在罕见病(如特发性肺纤维化)预测中,RiskAgent分别以27.27%和45.46%的准确率优势超越o1与GPT-4.5。最后,我们在外部循证诊断基准上进行了泛化性评估,结果显示RiskAgent取得了最优表现。这些令人鼓舞的结果证明了我们的解决方案在多样化诊断领域的巨大潜力。为提升模型在不同场景下的适应能力,我们构建并开源了参数量从10亿到700亿的系列模型。相关代码、数据及模型均已发布于https://github.com/AI-in-Health/RiskAgent。