"Crash Test Dummies" for AI-Enabled Clinical Assessment: Validating Virtual Patient Scenarios with Virtual Learners

Brian Gin,Ahreum Lim,Flávia Silva e Oliveira,Kuan Xing,Xiaomei Song,Gayana Amiyangoda,Thilanka Seneviratne,Alison F. Doubleday,Ananya Gangopadhyaya,Bob Kiser,Lukas Shum-Tim,Dhruva Patel,Kosala Marambe,Lauren Maggio,Ara Tekian,Yoon Soo Park

Background: In medical and health professions education (HPE), AI is increasingly used to assess clinical competencies, including via virtual standardized patients. However, most evaluations rely on AI-human interrater reliability and lack a measurement framework for how cases, learners, and raters jointly shape scores. This leaves robustness uncertain and can expose learners to misguidance from unvalidated systems. We address this by using AI "simulated learners" to stress-test and psychometrically characterize assessment pipelines before human use. Objective: Develop an open-source AI virtual patient platform and measurement model for robust competency evaluation across cases and rating conditions. Methods: We built a platform with virtual patients, virtual learners with tunable ACGME-aligned competency profiles, and multiple independent AI raters scoring encounters with structured Key-Features items. Transcripts were analyzed with a Bayesian HRM-SDT model that treats ratings as decisions under uncertainty and separates learner ability, case performance, and rater behavior; parameters were estimated with MCMC. Results: The model recovered simulated learners' competencies, with significant correlations to the generating competencies across all ACGME domains despite a non-deterministic pipeline. It estimated case difficulty by competency and showed stable rater detection (sensitivity) and criteria (severity/leniency thresholds) across AI raters using identical models/prompts but different seeds. We also propose a staged "safety blueprint" for deploying AI tools with learners, tied to entrustment-based validation milestones. Conclusions: Combining a purpose-built virtual patient platform with a principled psychometric model enables robust, interpretable, generalizable competency estimates and supports validation of AI-assisted assessment prior to use with human learners.

翻译：背景：在医学与健康专业教育中，人工智能正日益被用于评估临床能力，包括通过虚拟标准化患者进行评估。然而，大多数评估依赖于人工智能与人类评分者间的信度，缺乏关于病例、学习者和评分者如何共同影响得分的测量框架。这使得评估的稳健性存在不确定性，并可能导致学习者受到未经验证系统的误导。我们通过使用人工智能"模拟学习者"在人类使用前对评估流程进行压力测试和心理测量学表征来解决这一问题。目标：开发一个开源的人工智能虚拟患者平台和测量模型，以实现跨病例与评分条件的稳健能力评估。方法：我们构建了一个包含虚拟患者、具有可调节ACGME能力配置文件的虚拟学习者以及多个独立人工智能评分者的平台，这些评分者使用结构化关键特征项目对诊疗过程进行评分。通过贝叶斯HRM-SDT模型分析诊疗记录，该模型将评分视为不确定条件下的决策，并区分学习者能力、病例表现和评分者行为；参数通过MCMC方法进行估计。结果：该模型成功还原了模拟学习者的能力，尽管评估流程具有非确定性，但在所有ACGME能力域中均与生成能力呈显著相关。模型按能力维度估计了病例难度，并显示使用相同模型/提示但不同随机种子的各人工智能评分者具有稳定的评分者检测能力（敏感性）和判断标准（严格/宽松阈值）。我们还提出了分阶段的人工智能工具部署"安全蓝图"，该蓝图与基于信任度的验证里程碑相关联。结论：将专门构建的虚拟患者平台与原则性的心理测量模型相结合，能够实现稳健、可解释、可推广的能力评估，并为在人类学习者使用前验证人工智能辅助评估提供了支持。