A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

Peter Brodeur,Jacob M. Koshy,Anil Palepu,Khaled Saab,Ava Homiar,Roma Ruparel,Charles Wu,Ryutaro Tanno,Joseph Xu,Amy Wang,David Stutz,Hannah M. Ferrera,David Barrett,Lindsey Crowley,Jihyeon Lee,Spencer E. Rittner,Ellery Wulczyn,Selena K. Zhang,Elahe Vedadi,Christine G. Kohn,Kavita Kulkarni,Vinay Kadiyala,Sara Mahdavi,Wendy Du,Jessica Williams,David Feinbloom,Renee Wong,Tao Tu,Petar Sirkovic,Alessio Orlandi,Christopher Semturs,Yun Liu,Juraj Gottweis,Dale R. Webster,Joëlle Barral,Katherine Chou,Pushmeet Kohli,Avinatan Hassidim,Yossi Matias,James Manyika,Rob Fields,Jonathan X. Li,Marc L. Cohen,Vivek Natarajan,Mike Schaekermann,Alan Karthikesalingam,Adam Rodman

Large language model (LLM)-based AI systems have shown promise for patient-facing diagnostic and management conversations in simulated settings. Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight. We report a prospective, single-arm feasibility study of an LLM-based conversational AI, the Articulate Medical Intelligence Explorer (AMIE), conducting clinical history taking and presentation of potential diagnoses for patients to discuss with their provider at urgent care appointments at a leading academic medical center. 100 adult patients completed an AMIE text-chat interaction up to 5 days before their appointment. We sought to assess the conversational safety and quality, patient and clinician experience, and clinical reasoning capabilities compared to primary care providers (PCPs). Human safety supervisors monitored all patient-AMIE interactions in real time and did not need to intervene to stop any consultations based on pre-defined criteria. Patients reported high satisfaction and their attitudes towards AI improved after interacting with AMIE (p < 0.001). PCPs found AMIE's output useful with a positive impact on preparedness. AMIE's differential diagnosis (DDx) included the final diagnosis, per chart review 8 weeks post-encounter, in 90% of cases, with 75% top-3 accuracy. Blinded assessment of AMIE and PCP DDx and management (Mx) plans suggested similar overall DDx and Mx plan quality, without significant differences for DDx (p = 0.6) and appropriateness and safety of Mx (p = 0.1 and 1.0, respectively). PCPs outperformed AMIE in the practicality (p = 0.003) and cost effectiveness (p = 0.004) of Mx. While further research is needed, this study demonstrates the initial feasibility, safety, and user acceptance of conversational AI in a real-world setting, representing crucial steps towards clinical translation.

翻译：基于大语言模型（LLM）的人工智能系统在模拟环境中已展现出面向患者的诊断与管理对话应用前景。要将这些系统转化为临床实践，需要在真实世界工作流程中进行评估，并实施严格的安全监督。我们报告了一项前瞻性、单臂可行性研究，评估了一款基于LLM的对话式人工智能——Articulate Medical Intelligence Explorer（AMIE）——在一家顶尖学术医疗中心的急诊预约中，为患者进行临床病史采集并呈现潜在诊断以供患者与其医疗服务提供者讨论。100名成年患者在预约前最多5天内完成了与AMIE的文本聊天交互。我们旨在评估其对话安全性与质量、患者和临床医生的体验，以及与初级保健提供者相比的临床推理能力。人类安全监督员实时监控所有患者与AMIE的交互，且根据预设标准无需干预中止任何咨询。患者报告了高满意度，且与AMIE交互后对人工智能的态度有所改善（p < 0.001）。初级保健提供者认为AMIE的输出具有实用性，并对准备工作产生了积极影响。根据就诊后8周的病历审查，AMIE的鉴别诊断在90%的病例中包含了最终诊断，其前三项诊断的准确率达到75%。对AMIE与初级保健提供者的鉴别诊断及管理计划的盲法评估表明，两者在总体鉴别诊断和管理计划质量上相近，在鉴别诊断（p = 0.6）以及管理计划的适当性与安全性（p值分别为0.1和1.0）方面均无显著差异。初级保健提供者在管理计划的实用性（p = 0.003）和成本效益（p = 0.004）方面优于AMIE。尽管仍需进一步研究，但本研究表明了对话式人工智能在真实世界环境中的初步可行性、安全性和用户接受度，代表了向临床转化迈出的关键步伐。