Multi-Agent Systems (MAS) built on large language models typically solve complex tasks by coordinating multiple agents through workflows. Existing approaches generates workflows either at task level or query level, but their relative costs and benefits remain unclear. After rethinking and empirical analyses, we show that query-level workflow generation is not always necessary, since a small set of top-K best task-level workflows together already covers equivalent or even more queries. We further find that exhaustive execution-based task-level evaluation is both extremely token-costly and frequently unreliable. Inspired by the idea of self-evolution and generative reward modeling, we propose a low-cost task-level generation framework \textbf{SCALE}, which means \underline{\textbf{S}}elf prediction of the optimizer with few shot \underline{\textbf{CAL}}ibration for \underline{\textbf{E}}valuation instead of full validation execution. Extensive experiments demonstrate that \textbf{SCALE} maintains competitive performance, with an average degradation of just 0.61\% compared to existing approach across multiple datasets, while cutting overall token usage by up to 83\%.
翻译:基于大语言模型构建的多智能体系统通常通过协调多个智能体的工作流来解决复杂任务。现有方法在任务级别或查询级别生成工作流,但其相对成本与收益尚不明确。经过理论反思与实证分析,我们发现查询级工作流生成并非总是必要,因为一组数量有限的Top-K最优任务级工作流集合已能覆盖同等甚至更多的查询。我们进一步发现,基于穷举执行的任务级评估不仅令牌成本极高,且经常不可靠。受自演进思想和生成式奖励建模的启发,我们提出一种低成本任务级生成框架 **SCALE**,其核心在于通过少量示例校准的优化器自我预测进行\underline{\textbf{评}}估,而非完整的验证执行。大量实验表明,**SCALE** 在保持竞争力的同时,在多个数据集上相较于现有方法平均性能仅下降0.61%,而总体令牌使用量最高可降低83%。