Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.

翻译：大语言模型(LLMs)正被越来越多地提议作为临床代理使用，然而静态的、单轮次的基准测试无法捕捉模型在诊疗过程中如何动态提供医疗：收集信息、规划治疗，并在连续的患者状态中调整长期管理方案。医学教育长期以来通过标准化病人(SPs)应对类似挑战：由训练有素的演员始终如一地扮演临床病例，从而实现逼真的实践操作和客观的、基于脚本的评估。在此，我们提出MedSP1000，这是一个源于SP的、用于临床代理评估的交互式基准测试，包含1,638个SP病例及24,602条轨迹级同行评审评分细则。MedSP1000将经过同行评审的SP教学案例转化为可执行的场景，涵盖定义的SP病例脚本、临床环境上下文及经人工验证的结构化评分细则。在每次模拟评估运行中，临床代理与患者代理及环境控制器进行闭环交互，其行为在整个诊疗过程中依据原始资料中指定的专家标准进行评分。将MedSP1000应用于一系列通用型及医学专用型LLMs时，我们发现静态基准测试上的表现并不能可靠地转化为此类教育场景。表现最佳的模型GPT-5.5仅完成了60.4%的专家定义评分项，而最强的医学专用模型达到了40.0%；增加测试时的计算量并未带来可测量的提升。这些结果表明，当前的LLMs（包括为医学调优的代理系统）尚不足够可靠，无法安全整合到实际临床实践中。从更广泛的角度看，MedSP1000展示了过程级的SP风格评估如何揭示出单轮次基准测试遗漏的临床相关失效模式。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

基于大语言模型的医疗推理研究：综述与 MR-Bench 基准测试

专知会员服务

16+阅读 · 4月13日

大型语言模型遇上文本属性图：一种融合框架与应用的综述

专知会员服务

10+阅读 · 2025年10月27日

医学领域大型语言模型的新进展

专知会员服务

25+阅读 · 2025年10月5日

大语言模型与小语言模型协同机制综述

专知会员服务

40+阅读 · 2025年5月15日