Deliberative multi-agent large language models improve clinical reasoning in ophthalmology

Ehsan Misaghi,Sean T Berkowitz,Bing Yu Chen,Qingyu Chen,Renaud Duval,Pearse A Keane,Danny A Mammo,Ariel Yuhan Ong,Mertcan Sevgi,Sumit Sharma,Sunil K Srivastava,Yih Chung Tham,Fares Antaki

Large language models (LLMs) show potential for ophthalmic clinical reasoning, yet individual models risk introducing harm. We evaluated whether multi-agent LLM deliberative councils improve diagnostic performance and mitigate harm compared to individual LLMs. In a comparative cross-sectional study, we assessed 12 individual LLMs and three multi-agent councils on 100 ophthalmology clinical vignettes. Each council comprised four models assembled by type: proprietary flagship, proprietary fast, and open-source. Models independently answered a vignette, anonymously ranked one another's responses, and a designated chair synthesized all responses and peer reviews into a final answer. Councils consistently outperformed pooled individual models across all three tiers. Accuracy improved for proprietary flagship (95.0% vs 90.8%; risk difference [RD]: 4.25 [95% CI: 0.45, 8.05]), proprietary fast (96.0% vs 86.5%; RD: 9.50 [5.31, 13.59]), and open-source councils (91.0% vs 83.2%; RD: 7.75 [4.17, 11.33]). Harm rates declined for proprietary flagship (10.0% vs 22.5%; RD: -12.50 [-16.86, -8.14]), proprietary fast (16.0% vs 31.8%; RD: -15.75 [-21.49, -10.01]), and open-source councils (22.0% vs 38.5%; RD: -16.50 [-22.27, -10.73]). Coverage analysis revealed net positive gains for accuracy (ΔCoverage: 4.4-9.8 percentage points) and safety (ΔCoverage: 13.6-20.6), indicating councils recovered correct diagnoses and averted harm. Councils elevated correct diagnoses to higher rank positions; and produced more complete differentials and management plans (all P<.05). Harmful council responses showed reduced combined commission-and-omission errors and tended to be less severe. Structured deliberation via multi-agent LLM councils may enhance the reliability of LLM-assisted ophthalmic clinical reasoning.

翻译：大语言模型在眼科临床推理中展现出潜力，但单个模型可能带来风险。我们评估了多智能体大语言模型审议委员会相较于单个模型是否能提升诊断性能并降低风险。在这项比较性横断面研究中，我们评估了12个独立大语言模型和3个多智能体委员会在100个眼科临床案例中的表现。每个委员会由四类模型组成：所有权旗舰型、所有权快速型和开源型。模型独立回答问题，匿名互评其他模型回答，并由指定主席综合所有回答和同行评审得出最终答案。委员会在所有三类模型中均显著优于单个模型组合。准确率方面，旗舰型（95.0% vs 90.8%；风险差：4.25 [95% CI: 0.45, 8.05]）、快速型（96.0% vs 86.5%；风险差：9.50 [5.31, 13.59]）和开源型委员会（91.0% vs 83.2%；风险差：7.75 [4.17, 11.33]）均有所提升。伤害率方面，旗舰型（10.0% vs 22.5%；风险差：-12.50 [-16.86, -8.14]）、快速型（16.0% vs 31.8%；风险差：-15.75 [-21.49, -10.01]）和开源型委员会（22.0% vs 38.5%；风险差：-16.50 [-22.27, -10.73]）均有所下降。覆盖分析显示准确率（ΔCoverage: 4.4-9.8个百分点）和安全性（ΔCoverage: 13.6-20.6）均呈现净正向增益，表明委员会能够恢复正确诊断并避免伤害。委员会将正确诊断提升至更高排名位置，并生成更完整的鉴别诊断和管理计划（均P<0.05）。有害的委员会回答中合并性遗漏和错误减少，且严重程度趋于降低。通过多智能体大语言模型委员会进行结构化审议，可能增强大语言模型辅助眼科临床推理的可靠性。