Multi-Modal Large Language Models (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To bridge this gap, this paper introduces the first agent explicitly designed for the medical field, named \textbf{M}ulti-modal \textbf{Med}ical \textbf{Agent} (MMedAgent). We curate an instruction-tuning dataset comprising six medical tools solving seven tasks, enabling the agent to choose the most suitable tools for a given task. Comprehensive experiments demonstrate that MMedAgent achieves superior performance across a variety of medical tasks compared to state-of-the-art open-source methods and even the closed-source model, GPT-4o. Furthermore, MMedAgent exhibits efficiency in updating and integrating new medical tools.
翻译:尽管多模态大语言模型(MLLMs)取得了成功,但其通用性有限,且通常不及专用模型。近期,基于大语言模型的代理被开发出来,通过根据用户输入选择合适的专用模型作为工具来应对这些挑战。然而,此类进展在医疗领域尚未得到广泛探索。为弥补这一差距,本文引入了首个明确为医疗领域设计的代理,命名为**多模态医疗代理**(MMedAgent)。我们构建了一个指令微调数据集,包含解决七项任务的六种医疗工具,使代理能够为给定任务选择最合适的工具。综合实验表明,与最先进的开源方法乃至闭源模型GPT-4o相比,MMedAgent在各种医疗任务上均实现了更优的性能。此外,MMedAgent在更新和集成新的医疗工具方面表现出高效性。