MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.

翻译：大语言模型（LLMs）的快速发展激发了利用多智能体协作解决复杂医疗任务的兴趣。然而，多智能体协作方法的实际优势仍未得到充分理解。现有评估往往缺乏普适性，未能涵盖反映真实临床实践的多样化任务，且经常忽略与基于单一大语言模型的方法以及成熟传统方法的严格比较。为填补这一关键空白，我们提出了MedAgentBoard，一个用于系统评估多智能体协作、单一大语言模型及传统方法的综合性基准。MedAgentBoard涵盖四个多样化的医疗任务类别：(1) 医疗（视觉）问答，(2) 通俗摘要生成，(3) 结构化电子健康记录（EHR）预测建模，以及(4) 临床工作流自动化，涉及文本、医学图像和结构化EHR数据。我们的大量实验揭示了一个微妙的图景：虽然多智能体协作在特定场景中展现出优势，例如在临床工作流自动化中提升任务完整性，但它并未始终优于先进的单一大语言模型（例如在文本医疗问答中），并且，至关重要的是，也未超越通常能在医疗视觉问答和基于EHR的预测等任务中保持更好性能的专门化传统方法。MedAgentBoard提供了一个重要资源和可操作的见解，强调了在医学领域选择和开发人工智能解决方案时，采取基于任务特性、循证的方法的必要性。它强调，必须仔细权衡多智能体协作固有的复杂性和开销与其实质性的性能提升。所有代码、数据集、详细提示和实验结果已在 https://medagentboard.netlify.app/ 开源。