Large Language Models (LLMs) achieve competitive results compared to human experts in medical examinations. However, it remains a challenge to apply LLMs to complex clinical decision-making, which requires a deep understanding of medical knowledge and differs from the standardized, exam-style scenarios commonly used in current efforts. A common approach is to fine-tune LLMs for target tasks, which, however, not only requires substantial data and computational resources but also remains prone to generating `hallucinations'. In this work, we present RiskAgent, which synergizes language models with hundreds of validated clinical decision tools supported by evidence-based medicine, to provide generalizable and faithful recommendations. Our experiments show that RiskAgent not only achieves superior performance on a broad range of clinical risk predictions across diverse scenarios and diseases, but also demonstrates robust generalization in tool learning on the external MedCalc-Bench dataset, as well as in medical reasoning and question answering on three representative benchmarks, MedQA, MedMCQA, and MMLU.
翻译:大型语言模型(LLM)在医学考试中取得了与人类专家相媲美的结果。然而,将LLM应用于复杂的临床决策仍面临挑战,这需要深厚的医学知识理解,且不同于当前研究中常用的标准化、考试式场景。一种常见方法是对LLM进行目标任务微调,但这不仅需要大量数据和计算资源,且仍易产生“幻觉”。在本工作中,我们提出RiskAgent,它将语言模型与数百种由循证医学支持的已验证临床决策工具相协同,以提供可泛化且可靠的推荐。我们的实验表明,RiskAgent不仅在多种场景和疾病的广泛临床风险预测中实现了卓越性能,还在外部MedCalc-Bench数据集上的工具学习,以及在MedQA、MedMCQA和MMLU三个代表性基准上的医学推理与问答中,均展现出强大的泛化能力。