Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Adam Fourney,Gagan Bansal,Hussein Mozannar,Cheng Tan,Eduardo Salinas, Erkang, Zhu,Friederike Niedtner,Grace Proebsting,Griffin Bassman,Jack Gerrits,Jacob Alber,Peter Chang,Ricky Loynd,Robert West,Victor Dibia,Ahmed Awadallah,Ece Kamar,Rafah Hosn,Saleema Amershi

Modern AI agents, driven by advances in large foundation models, promise to enhance our productivity and transform our lives by augmenting our knowledge and capabilities. To achieve this vision, AI agents must effectively plan, perform multi-step reasoning and actions, respond to novel observations, and recover from errors, to successfully complete complex tasks across a wide range of scenarios. In this work, we introduce Magentic-One, a high-performing open-source agentic system for solving such tasks. Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code. We show that Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena. Magentic-One achieves these results without modification to core agent capabilities or to how they collaborate, demonstrating progress towards generalist agentic systems. Moreover, Magentic-One's modular design allows agents to be added or removed from the team without additional prompt tuning or training, easing development and making it extensible to future scenarios. We provide an open-source implementation of Magentic-One, and we include AutoGenBench, a standalone tool for agentic evaluation. AutoGenBench provides built-in controls for repetition and isolation to run agentic benchmarks in a rigorous and contained manner -- which is important when agents' actions have side-effects. Magentic-One, AutoGenBench and detailed empirical performance evaluations of Magentic-One, including ablations and error analysis are available at https://aka.ms/magentic-one

翻译：现代人工智能智能体，在大型基础模型进步的推动下，有望通过增强我们的知识与能力来提高生产力并改变我们的生活。为实现这一愿景，AI智能体必须能够有效地进行规划、执行多步推理与行动、响应新颖的观察并从错误中恢复，从而在广泛场景中成功完成复杂任务。在本工作中，我们介绍了Magentic-One，一个用于解决此类任务的高性能开源智能体系统。Magentic-One采用多智能体架构，其中一个主导智能体——协调器（Orchestrator）——负责规划、跟踪进度并在出错时重新规划。在整个任务执行过程中，协调器根据需要指导其他专业智能体执行任务，例如操作网络浏览器、导航本地文件或编写并执行Python代码。我们证明，在三个多样化且具有挑战性的智能体基准测试（GAIA、AssistantBench和WebArena）上，Magentic-One取得了与最先进方法具有统计竞争力的性能。Magentic-One无需修改核心智能体能力或其协作方式即可实现这些结果，这展示了向通用智能体系统迈进的进展。此外，Magentic-One的模块化设计允许在不进行额外提示调优或训练的情况下向团队添加或移除智能体，从而简化了开发过程并使其能够扩展至未来场景。我们提供了Magentic-One的开源实现，并包含了AutoGenBench，一个用于智能体评估的独立工具。AutoGenBench提供了内置的重复与隔离控制，以便以严谨且受控的方式运行智能体基准测试——这在智能体行为可能产生副作用时尤为重要。Magentic-One、AutoGenBench以及Magentic-One的详细实证性能评估（包括消融实验和错误分析）可在 https://aka.ms/magentic-one 获取。