A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

Peter Brodeur,Jacob M. Koshy,Anil Palepu,Khaled Saab,Ava Homiar,Roma Ruparel,Charles Wu,Ryutaro Tanno,Joseph Xu,Amy Wang,David Stutz,Hannah M. Ferrera,David Barrett,Lindsey Crowley,Jihyeon Lee,Spencer E. Rittner,Ellery Wulczyn,Selena K. Zhang,Elahe Vedadi,Christine G. Kohn,Kavita Kulkarni,Vinay Kadiyala,Sara Mahdavi,Wendy Du,Jessica Williams,David Feinbloom,Renee Wong,Tao Tu,Petar Sirkovic,Alessio Orlandi,Christopher Semturs,Yun Liu,Juraj Gottweis,Dale R. Webster,Joëlle Barral,Katherine Chou,Pushmeet Kohli,Avinatan Hassidim,Yossi Matias,James Manyika,Rob Fields,Jonathan X. Li,Marc L. Cohen,Vivek Natarajan,Mike Schaekermann,Alan Karthikesalingam,Adam Rodman

Large language model (LLM)-based AI systems have shown promise for patient-facing diagnostic and management conversations in simulated settings. Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight. We report a prospective, single-arm feasibility study of an LLM-based conversational AI, the Articulate Medical Intelligence Explorer (AMIE), conducting clinical history taking and presentation of potential diagnoses for patients to discuss with their provider at urgent care appointments at a leading academic medical center. 100 adult patients completed an AMIE text-chat interaction up to 5 days before their appointment. We sought to assess the conversational safety and quality, patient and clinician experience, and clinical reasoning capabilities compared to primary care providers (PCPs). Human safety supervisors monitored all patient-AMIE interactions in real time and did not need to intervene to stop any consultations based on pre-defined criteria. Patients reported high satisfaction and their attitudes towards AI improved after interacting with AMIE (p < 0.001). PCPs found AMIE's output useful with a positive impact on preparedness. AMIE's differential diagnosis (DDx) included the final diagnosis, per chart review 8 weeks post-encounter, in 90% of cases, with 75% top-3 accuracy. Blinded assessment of AMIE and PCP DDx and management (Mx) plans suggested similar overall DDx and Mx plan quality, without significant differences for DDx (p = 0.6) and appropriateness and safety of Mx (p = 0.1 and 1.0, respectively). PCPs outperformed AMIE in the practicality (p = 0.003) and cost effectiveness (p = 0.004) of Mx. While further research is needed, this study demonstrates the initial feasibility, safety, and user acceptance of conversational AI in a real-world setting, representing crucial steps towards clinical translation.

翻译：基于大语言模型（LLM）的人工智能系统在模拟环境中已显示出面向患者的诊断与管理对话应用潜力。将这些系统转化为临床实践，需要在现实工作流程中进行评估并实施严格的安全监督。我们报告了一项前瞻性、单臂可行性研究，评估了基于LLM的对话式人工智能——Articulate Medical Intelligence Explorer（AMIE）——在一家顶尖学术医疗中心的急诊预约中，为患者进行临床病史采集并呈现潜在诊断以供患者与其医疗服务提供者讨论的情况。100名成年患者在预约前最多5天内完成了与AMIE的文本聊天交互。我们旨在评估其对话安全性与质量、患者和临床医生的体验，以及与初级保健提供者（PCPs）相比的临床推理能力。人类安全监督员实时监控了所有患者与AMIE的交互，且根据预设标准无需干预叫停任何咨询。患者报告了高满意度，并且与AMIE交互后对人工智能的态度有所改善（p < 0.001）。PCPs认为AMIE的输出有用，并对（诊疗）准备度有积极影响。根据就诊后8周的图表审查，AMIE的鉴别诊断（DDx）在90%的病例中包含了最终诊断，其中前三位诊断的准确率为75%。对AMIE和PCPs的DDx及管理（Mx）计划的盲法评估表明，其总体DDx和Mx计划质量相似，在DDx（p = 0.6）以及Mx的适当性和安全性（p值分别为0.1和1.0）方面均无显著差异。PCPs在Mx的实用性（p = 0.003）和成本效益（p = 0.004）方面优于AMIE。尽管需要进一步研究，但本研究证明了对话式人工智能在现实环境中的初步可行性、安全性和用户接受度，代表了向临床转化迈出的关键步骤。