Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.
翻译:多智能体大语言模型系统已成为临床诊断领域一种前景广阔的方法,其通过智能体间的协作来优化医学推理。然而,现有框架大多依赖单一供应商团队(例如,使用来自同一模型家族的多个智能体),这存在相关故障模式的风险,可能强化共享偏见而非纠正它们。我们通过比较单一大语言模型、单一供应商以及混合供应商的多智能体对话框架,研究了供应商多样性的影响。我们使用 o4-mini、Gemini-2.5-Pro 和 Claude-4.5-Sonnet 实例化了三位医生智能体,并在 RareBench 和 DiagnosisArena 数据集上评估了其性能。混合供应商配置在各项指标上持续优于单一供应商配置,实现了最先进的召回率和准确率。重叠分析揭示了其底层机制:混合供应商团队汇集了互补的归纳偏差,从而发掘出单个模型或同质化团队集体遗漏的正确诊断。这些结果表明,供应商多样性是构建稳健临床诊断系统的关键设计原则。