Auditing medical multi-agent AI reveals risks of false consensus

Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through specialist roles, peer review and consensus formation. In clinical decision support, however, apparent consensus is not enough. Clinicians also need to know whether agents checked the evidence, addressed disagreement and kept uncertainty visible. Current evaluations largely score final accuracy, leaving the safety of the collaborative process untested. Here we introduce MedAgentAudit, a clinically grounded workflow audit framework for diagnosing and quantifying collaborative failure modes in medical multi-agent systems. From 3,600 execution logs, we derive an expert-validated taxonomy of ten recurrent failures spanning task comprehension, collaborative discussion, and synthesis and decision-making. We then deploy an expert-validated automated auditor as non-interventional probes across 14,400 cases, covering six multi-agent architectures, six medical text and vision datasets, and four large language model settings per modality. Across systems, collaboration yields uneven accuracy gains and frequent process failures. Unsupported observations affect 16.63% of cases and propagate downstream. In discussion, agents repeat initial views in 98.42% of cases rather than re-examining evidence, and fail to activate specialist reasoning in 42.73%. During synthesis, final answers often substitute authority or majority count for evidence checking, showing authority bias in 28.76% (rising from 35.30% to 68.75% across rounds), self-contradiction in 18.53%, contradiction neglect in 5.48% and minority suppression in 5.11%. MedAgentAudit reframes medical AI evaluation from output scoring to process-level safety and accountability, providing a practical foundation for transparent, auditable and clinician-supervised agentic systems in medicine.

翻译：大语言模型正逐渐被整合到医疗多智能体系统中，这些系统通过专家角色、同行评审和共识形成来模拟多学科会诊。然而，在临床决策支持中，表面共识并不足够。临床医生还需要知道智能体是否检查过证据、处理过分歧并保持不确定性可见。当前评估主要关注最终准确性，而忽视了协作过程的安全性。本文提出MedAgentAudit——一个基于临床的工作流审计框架，用于诊断和量化医疗多智能体系统中的协作失败模式。基于3600条执行日志，我们通过专家验证构建了十种反复出现的失败分类体系，涵盖任务理解、协作讨论以及综合与决策三个阶段。随后，我们部署经专家验证的自动审计器作为非干预式探针，涵盖14400个案例，包括六种多智能体架构、六个医疗文本和视觉数据集以及每种模态下的四种大语言模型配置。不同系统间，协作带来的准确性提升不均匀且伴随频繁的过程失败。16.63%的案例存在缺乏证据支持的观点，这些观点会向下游传播。讨论环节中，智能体在98.42%的案例中重复初始观点而非重新审视证据，并在42.73%的案例中未能激活专家推理。综合阶段，最终答案常以权威或多数票替代证据核查，表现为28.76%的权威偏误（随轮次从35.30%升至68.75%）、18.53%的自相矛盾、5.48%的矛盾忽视及5.11%的少数意见压制。MedAgentAudit将医疗AI评估从输出评分转向过程级安全与问责，为构建透明、可审计且受临床医生监督的医疗智能体系统提供了实践基础。