Scaling laws for inference compute in multi-agent systems remain under-explored compared to single-agent scenarios. This work aims to bridge this gap by investigating the problem of data synthesis through multi-agent sampling, where synthetic responses are generated by sampling from multiple distinct language models. Effective model coordination is crucial for successful multi-agent collaboration. Unlike previous approaches that rely on fixed workflows, we treat model coordination as a multi-step decision-making process, optimizing generation structures dynamically for each input question. We introduce Tree Search-based Orchestrated Agents~(TOA), where the workflow evolves iteratively during the sequential sampling process. To achieve this, we leverage Monte Carlo Tree Search (MCTS), integrating a reward model to provide real-time feedback and accelerate exploration. Our experiments on alignment, machine translation, and mathematical reasoning demonstrate that multi-agent sampling significantly outperforms single-agent sampling as inference compute scales. TOA is the most compute-efficient approach, achieving SOTA performance on WMT and a 71.8\% LC win rate on AlpacaEval. Moreover, fine-tuning with our synthesized alignment data surpasses strong preference learning methods on challenging benchmarks such as Arena-Hard and AlpacaEval.
翻译:相较于单智能体场景,多智能体系统中推理计算的扩展规律仍缺乏深入探索。本研究旨在通过探究多智能体采样中的数据合成问题来弥合这一差距,其中合成响应通过从多个不同语言模型中采样生成。有效的模型协调对于成功的多智能体协作至关重要。与以往依赖固定工作流的方法不同,我们将模型协调视为多步决策过程,针对每个输入问题动态优化生成结构。我们提出了基于树搜索的编排智能体(TOA),其工作流在顺序采样过程中迭代演化。为实现这一目标,我们利用蒙特卡洛树搜索(MCTS),集成奖励模型以提供实时反馈并加速探索。我们在对齐任务、机器翻译和数学推理上的实验表明,随着推理计算规模的扩大,多智能体采样显著优于单智能体采样。TOA是计算效率最高的方法,在WMT上达到SOTA性能,并在AlpacaEval上获得71.8%的LC胜率。此外,使用我们合成的对齐数据进行微调,在Arena-Hard和AlpacaEval等具有挑战性的基准测试中超越了强偏好学习方法。