Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.
翻译:大语言模型(LLMs)在临床咨询任务中的应用日益受到关注,然而大多数医学评估仍停留在静态、单轮或狭隘的结果导向层面,难以反映真实医疗场景中序贯性、不确定性与交互性的特点。本文提出AIPatient Arena——一种基于电子健康记录(EHRs)的评估框架,用于从八个临床能力维度评估LLMs的临床实用性。该框架将EHR数据整合为患者特异性知识图谱,支持多轮医患交互。我们在包含437名患者的主要队列以及分别有119名和67名患者的两个分布外验证队列上应用了AIPatient Arena。研究发现,LLMs在医学访谈提问技能(QS;平均得分4.43-4.99/5)、伦理与专业行为(ET;4.38-4.93/5)以及临床解释的清晰度与透明度(EX;3.80-4.72/5)方面表现良好。信息整合(II;3.19-4.21/5)与用药安全及合理性(MS;3.13-3.78/5)表现中等,但在处理模糊患者应答(HR;2.57-3.32/5)、信息覆盖范围(IC;2.08-3.02/5)以及诊断准确性与推理(Dx;2.63-3.55/5)方面持续存在不足。基于过程的评估揭示了反复出现的交互失败模式,包括重复提问、遗漏既往病史以及应对不确定性能力不足。更丰富的对话上下文虽改善了诊断推理能力,但在治疗规划方面提升有限。这些结果表明,仅凭最终答案的准确性不足以评估临床就绪程度,凸显了评估模型在咨询过程中如何收集、解读和传递信息的重要性。AIPatient Arena为面向工作流程的医学LLMs部署前评估提供了一个基于EHR的框架。