The evolution of Large Language Models (LLMs) from passive text processors to autonomous agents has established planning as a core component of modern intelligence. However, achieving generalized planning remains elusive, not only by the scarcity of high-quality interaction data but also by inherent conflicts across heterogeneous planning tasks. These challenges result in models that excel at isolated tasks yet struggle to generalize, while existing multi-task training attempts suffer from gradient interference. In this paper, we present \textbf{MagicAgent}, a series of foundation models specifically designed for generalized agent planning. We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks, including hierarchical task decomposition, tool-augmented planning, multi-constraint scheduling, procedural logic orchestration, and long-horizon tool execution. To mitigate training conflicts, we propose a two-stage training paradigm comprising supervised fine-tuning followed by multi-objective reinforcement learning over both static datasets and dynamic environments. Empirical results demonstrate that MagicAgent-32B and MagicAgent-30B-A3B deliver superior performance, achieving accuracies of $75.1\%$ on Worfbench, $55.9\%$ on NaturalPlan, $57.5\%$ on $τ^2$-Bench, $86.9\%$ on BFCL-v3, and $81.2\%$ on ACEBench, as well as strong results on our in-house MagicEval benchmarks. These results substantially outperform existing sub-100B models and even surpass leading closed-source models.
翻译:大型语言模型(LLM)从被动文本处理器向自主智能体的演进,已使规划成为现代智能的核心组成部分。然而,实现通用化规划仍面临挑战,这不仅源于高质量交互数据的稀缺,更因异构规划任务间存在内在冲突。这些问题导致模型虽能胜任孤立任务却难以泛化,而现有的多任务训练尝试又常受梯度干扰所困。本文提出\textbf{MagicAgent}系列基础模型,专为通用智能体规划而设计。我们引入轻量级可扩展的合成数据框架,可生成涵盖多种规划任务的高质量轨迹,包括分层任务分解、工具增强规划、多约束调度、流程逻辑编排及长程工具执行。为缓解训练冲突,我们提出两阶段训练范式:先进行监督微调,再基于静态数据集与动态环境开展多目标强化学习。实验结果表明,MagicAgent-32B与MagicAgent-30B-A3B表现出卓越性能,在Worfbench准确率达$75.1\%$,NaturalPlan达$55.9\%$,$τ^2$-Bench达$57.5\%$,BFCL-v3达$86.9\%$,ACEBench达$81.2\%$,并在自建的MagicEval基准测试中取得强劲结果。这些成绩显著超越现有百亿参数以下模型,甚至优于领先的闭源模型。