AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Diagnosing and managing a patient is a complex, sequential decision making process that requires physicians to obtain information -- such as which tests to perform -- and to act upon it. Recent advances in artificial intelligence (AI) and large language models (LLMs) promise to profoundly impact clinical care. However, current evaluation schemes overrely on static medical question-answering benchmarks, falling short on interactive decision-making that is required in real-life clinical work. Here, we present AgentClinic: a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. In our benchmark, the doctor agent must uncover the patient's diagnosis through dialogue and active data collection. We present two open medical agent benchmarks: a multimodal image and dialogue environment, AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. We embed cognitive and implicit biases both in patient and doctor agents to emulate realistic interactions between biased agents. We find that introducing bias leads to large reductions in diagnostic accuracy of the doctor agents, as well as reduced compliance, confidence, and follow-up consultation willingness in patient agents. Evaluating a suite of state-of-the-art LLMs, we find that several models that excel in benchmarks like MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. We show that both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents. The code and data for this work is publicly available at https://AgentClinic.github.io.

翻译：诊断与管理患者是一个复杂的序贯决策过程，需要医生获取信息（例如决定执行哪些检查）并据此采取行动。人工智能（AI）与大型语言模型（LLM）的最新进展有望深刻影响临床诊疗。然而，当前的评估方案过度依赖静态医学问答基准，未能充分体现真实临床工作中所需的交互式决策能力。本文提出 AgentClinic：一个用于评估 LLM 在模拟临床环境中作为智能体运作能力的多模态基准。在我们的基准中，医生智能体必须通过对话和主动数据收集来揭示患者的诊断。我们提出了两个开放的医学智能体基准：多模态图像与对话环境 AgentClinic-NEJM，以及纯对话环境 AgentClinic-MedQA。我们在患者和医生智能体中嵌入了认知与隐性偏见，以模拟存在偏见的智能体之间的真实交互。我们发现，引入偏见会导致医生智能体的诊断准确率大幅下降，同时降低患者智能体的依从性、信心和后续咨询意愿。通过对一系列前沿 LLM 进行评估，我们发现多个在 MedQA 等基准中表现优异的模型在 AgentClinic-MedQA 中表现不佳。我们发现患者智能体所使用的 LLM 是影响 AgentClinic 基准性能的重要因素。研究表明，无论是交互次数过少还是过多，都会降低医生智能体的诊断准确率。本工作的代码与数据公开于 https://AgentClinic.github.io。