Objective: This paper introduces a patient simulator for scalable, automated evaluation of healthcare conversational agents, generating realistic, controllable interactions that systematically vary across medical, linguistic, and behavioral dimensions to support risk assessment across populations. Methods: Grounded in the NIST AI Risk Management Framework, the simulator integrates three profile components: (1) medical profiles constructed from All of Us electronic health records using risk-ratio gating; (2) linguistic profiles modeling health literacy and condition-specific communication; and (3) behavioral profiles representing cooperative, distracted, and adversarial engagement. Profiles were evaluated against NIST AI RMF trustworthiness requirements and assessed against an AI Decision Aid for antidepressant selection. Results: Across 500 simulated conversations, the simulator revealed monotonic degradation in AI Decision Aid performance across health literacy levels: Rank-1 concept retrieval ranged from 47.6% (limited) to 81.9% (proficient), with corresponding recommendation degradation. Medical concept fidelity was high (96.6% across 8,210 concepts), validated by human annotators (0.73 kappa) and an LLM judge with comparable agreement (0.78 kappa). Behavioral profiles were reliably distinguished (0.93 kappa), and linguistic profiles showed moderate agreement (0.61 kappa). Conclusions: The simulator exposes measurable performance risks in conversational healthcare AI. Health literacy emerged as a primary risk factor with direct implications for equitable AI deployment.
翻译:目的:本文提出一种患者模拟器,用于医疗对话系统可扩展的自动化评估,生成在医学、语言和行为维度上系统变化的真实可控交互,以支持跨人群的风险评估。方法:基于NIST人工智能风险管理框架,该模拟器整合三个配置组件:(1)利用风险比筛选从全美健康研究电子健康记录构建的医学档案;(2)建模健康素养和特定疾病沟通模式的语言档案;(3)代表合作型、分心型和对抗型参与模式的行为档案。根据NIST AI RMF可信度要求对档案进行评估,并针对抗抑郁药选择的AI决策辅助系统进行验证。结果:在500次模拟对话中,模拟器揭示了AI决策辅助系统性能随健康素养水平呈单调衰减:排名第一概念检索率从47.6%(有限素养)降至81.9%(熟练素养),伴随相应的推荐质量下降。医学概念保真度达96.6%(覆盖8,210个概念),经人工标注者验证(卡帕系数0.73)与具有可比一致性的LLM评估(卡帕系数0.78)确认。行为档案区分可靠(卡帕系数0.93),语言档案呈现中等一致性(卡帕系数0.61)。结论:该模拟器揭示了医疗对话AI中可测量的性能风险。健康素养成为影响公平AI部署的主要风险因素。