Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.
翻译:大语言模型(LLMs)正被越来越多地提议作为临床代理使用,然而静态的、单轮次的基准测试无法捕捉模型在诊疗过程中如何动态提供医疗:收集信息、规划治疗,并在连续的患者状态中调整长期管理方案。医学教育长期以来通过标准化病人(SPs)应对类似挑战:由训练有素的演员始终如一地扮演临床病例,从而实现逼真的实践操作和客观的、基于脚本的评估。在此,我们提出MedSP1000,这是一个源于SP的、用于临床代理评估的交互式基准测试,包含1,638个SP病例及24,602条轨迹级同行评审评分细则。MedSP1000将经过同行评审的SP教学案例转化为可执行的场景,涵盖定义的SP病例脚本、临床环境上下文及经人工验证的结构化评分细则。在每次模拟评估运行中,临床代理与患者代理及环境控制器进行闭环交互,其行为在整个诊疗过程中依据原始资料中指定的专家标准进行评分。将MedSP1000应用于一系列通用型及医学专用型LLMs时,我们发现静态基准测试上的表现并不能可靠地转化为此类教育场景。表现最佳的模型GPT-5.5仅完成了60.4%的专家定义评分项,而最强的医学专用模型达到了40.0%;增加测试时的计算量并未带来可测量的提升。这些结果表明,当前的LLMs(包括为医学调优的代理系统)尚不足够可靠,无法安全整合到实际临床实践中。从更广泛的角度看,MedSP1000展示了过程级的SP风格评估如何揭示出单轮次基准测试遗漏的临床相关失效模式。