AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

Jiahui Niu,Huizi Yu,Wenkong Wang,Guangxin Dai,Jingxian He,Xiang Li,Zhiying Liang,Xinxin Lin,Kent CY So,Bryan YP Yan,Yun Kwok Wing,Yanqiu Xing,Xin Ma,Lizhou Fan

from arxiv, 49 pages, 12 figues, 11 tables

Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

翻译：大语言模型（LLMs）在临床咨询任务中的应用日益受到关注，然而大多数医学评估仍停留在静态、单轮或狭隘的结果导向层面，难以反映真实医疗场景中序贯性、不确定性与交互性的特点。本文提出AIPatient Arena——一种基于电子健康记录（EHRs）的评估框架，用于从八个临床能力维度评估LLMs的临床实用性。该框架将EHR数据整合为患者特异性知识图谱，支持多轮医患交互。我们在包含437名患者的主要队列以及分别有119名和67名患者的两个分布外验证队列上应用了AIPatient Arena。研究发现，LLMs在医学访谈提问技能（QS；平均得分4.43-4.99/5）、伦理与专业行为（ET；4.38-4.93/5）以及临床解释的清晰度与透明度（EX；3.80-4.72/5）方面表现良好。信息整合（II；3.19-4.21/5）与用药安全及合理性（MS；3.13-3.78/5）表现中等，但在处理模糊患者应答（HR；2.57-3.32/5）、信息覆盖范围（IC；2.08-3.02/5）以及诊断准确性与推理（Dx；2.63-3.55/5）方面持续存在不足。基于过程的评估揭示了反复出现的交互失败模式，包括重复提问、遗漏既往病史以及应对不确定性能力不足。更丰富的对话上下文虽改善了诊断推理能力，但在治疗规划方面提升有限。这些结果表明，仅凭最终答案的准确性不足以评估临床就绪程度，凸显了评估模型在咨询过程中如何收集、解读和传递信息的重要性。AIPatient Arena为面向工作流程的医学LLMs部署前评估提供了一个基于EHR的框架。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

基于大语言模型的医疗推理研究：综述与 MR-Bench 基准测试

专知会员服务

16+阅读 · 4月13日

【AAAI2026】NeSTR：一种用于大型语言模型的神经-符号可溯因框架，用于时间推理

专知会员服务

17+阅读 · 2025年12月10日

医学领域大型语言模型的新进展

专知会员服务

25+阅读 · 2025年10月5日

【斯坦福博士论文】大语言模型的AI辅助评估

专知会员服务

31+阅读 · 2025年3月30日