Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.
翻译:主流观点认为,多智能体系统优于单智能体系统,并列举了上下文保护、并行处理和分布式决策等优势。然而,这一论断的经验支持主要依赖于与使用优先考虑孤立推理任务的基准测试的单智能体基线进行的比较,这未能充分评估这些优势。聚焦于为比手动设计的同类系统具有更强泛化能力而设计的自动生成多智能体系统,我们针对单智能体系统,特别是带有自一致性的思维链,进行了严格、系统的评估。在传统推理数据集和具有交互式多步骤工作流的任务上,我们证明自动多智能体系统尽管成本高达10倍,但其表现始终逊于Co T-SC。为了将这些失败与任务结构固有的局限性隔离开,我们引入了一个针对多智能体系统量身定制的诊断性合成数据集,该数据集具有显式任务分解、上下文分离和并行化潜力。我们证明,在该数据集上,由专家设计的多智能体系统在原始性能和成本效率方面始终优于自动生成的架构,表明现有的评估框架因未能考虑增加计算成本的边际效用,而掩盖了复杂多智能体系统的关键架构差距和低效率。至关重要的是,对生成的多智能体系统架构的系统解构揭示,当前的自动设计范式会产生架构膨胀,这种膨胀优先考虑不会转化为功能效用的表面复杂性,从而暴露了与多智能体原则的根本性错位。