Although large language models (LLMs) have made significant strides across various tasks, they still face significant challenges in complex reasoning and planning. For example, even with carefully designed prompts and prior information explicitly provided, GPT-4o achieves only a 7% Final Pass Rate on the TravelPlanner dataset in the sole-planning mode. Similarly, even in the thinking mode, Qwen3-8B-Instruct and DeepSeek-R1-671B, only achieve Final Pass Rates of 5.9% and 40%, respectively. Although well-organized Multi-Agent Systems (MAS) can offer improved collective reasoning, they often suffer from high reasoning costs due to multi-round internal interactions, long per-response latency, and difficulties in end-to-end training. To address these challenges, we propose a general and scalable framework called IMAGINE, short for Integrating Multi-Agent System into One Model. This framework not only integrates the reasoning and planning capabilities of MAS into a single, compact model, but also significantly surpass the capabilities of the MAS through a simple end-to-end training. Through this pipeline, a single small-scale model is not only able to acquire the structured reasoning and planning capabilities of a well-organized MAS but can also significantly outperform it. Experimental results demonstrate that, when using Qwen3-8B-Instruct as the base model and training it with our method, the model achieves an 82.7% Final Pass Rate on the TravelPlanner benchmark, far exceeding the 40% of DeepSeek-R1-671B, while maintaining a much smaller model size.
翻译:尽管大型语言模型(LLMs)已在多项任务中取得显著进展,但在复杂推理与规划方面仍面临重大挑战。例如,即使在精心设计提示词且明确提供先验信息的情况下,GPT-4o 在 TravelPlanner 数据集的单一规划模式下仅达到 7% 的最终通过率。类似地,即使在思维模式下,Qwen3-8B-Instruct 与 DeepSeek-R1-671B 的最终通过率也分别仅为 5.9% 和 40%。虽然组织良好的多智能体系统(MAS)能够提供更优的集体推理能力,但其常因多轮内部交互导致高昂的推理成本、较长的单次响应延迟以及端到端训练的困难。为应对这些挑战,我们提出一个通用且可扩展的框架 IMAGINE(全称为“将多智能体系统集成于单一模型”)。该框架不仅将 MAS 的推理与规划能力整合至一个紧凑的单一模型中,更能通过简单的端到端训练显著超越 MAS 的能力。通过此流程,单个小规模模型不仅能获得组织良好的 MAS 所具备的结构化推理与规划能力,更能大幅超越其性能。实验结果表明,以 Qwen3-8B-Instruct 作为基础模型并采用我们的方法进行训练后,该模型在 TravelPlanner 基准测试中达到 82.7% 的最终通过率,远超 DeepSeek-R1-671B 的 40% 通过率,同时保持更小的模型规模。