AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Diagnosing and managing a patient is a complex, sequential decision making process that requires physicians to obtain information -- such as which tests to perform -- and to act upon it. Recent advances in artificial intelligence (AI) and large language models (LLMs) promise to profoundly impact clinical care. However, current evaluation schemes overrely on static medical question-answering benchmarks, falling short on interactive decision-making that is required in real-life clinical work. Here, we present AgentClinic: a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. In our benchmark, the doctor agent must uncover the patient's diagnosis through dialogue and active data collection. We present two open benchmarks: a multimodal image and dialogue environment, AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. We embed cognitive and implicit biases both in patient and doctor agents to emulate realistic interactions between biased agents. We find that introducing bias leads to large reductions in diagnostic accuracy of the doctor agents, as well as reduced compliance, confidence, and follow-up consultation willingness in patient agents. Evaluating a suite of state-of-the-art LLMs, we find that several models that excel in benchmarks like MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. We show that both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents. The code and data for this work is publicly available at https://AgentClinic.github.io.

翻译：诊断和管理患者是一个复杂的序列决策过程，要求医生获取信息（例如应执行哪些检查）并据此采取行动。近期人工智能（AI）与大语言模型（LLM）的进展有望深刻影响临床诊疗。然而，当前评估方案过度依赖静态医疗问答基准，在真实临床工作所需的交互式决策方面存在不足。本文提出AgentClinic：一个多模态基准，用于评估LLM在模拟临床环境中作为智能体运作的能力。在该基准中，医生智能体必须通过对话和主动数据收集来揭示患者诊断。我们提出两个开放基准：多模态图像与对话环境AgentClinic-NEJM，以及纯对话环境AgentClinic-MedQA。我们在患者和医生智能体中嵌入认知偏差与隐性偏差，以模拟偏差智能体间的真实交互。研究发现引入偏差会导致医生智能体诊断准确率大幅下降，同时患者智能体的依从性、置信度和随访意愿降低。通过评估系列前沿LLM，我们发现多个在MedQA等基准中表现优异的模型在AgentClinic-MedQA中表现欠佳。患者智能体使用的LLM是影响AgentClinic基准性能的重要因素。研究表明，有限的交互次数与过多的交互次数都会降低医生智能体的诊断准确率。本工作的代码与数据已公开于https://AgentClinic.github.io。