Reward Modeling for Multi-Agent Orchestration

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.

翻译：基于大语言模型(LLM)的多智能体系统(MAS)需要高效的编排机制来协调专业智能体，然而训练此类编排器面临监督信号有限和计算成本高昂的挑战。本文提出了编排奖励建模(OrchRM)——一种无需人工标注即可评估编排质量的自监督框架。OrchRM利用多智能体执行过程中的中间产物构建Bradley-Terry奖励模型训练的胜负对。与依赖昂贵子智能体展开的现有MAS测试时扩展及编排器训练框架不同，OrchRM直接在编排层运作，从而支持高效且高性能的奖励引导式编排器训练与MAS测试时扩展。OrchRM在令牌使用量上提升训练效率高达10倍，同时在准确率上使MAS测试时扩展性能提升最高8%。这些增益在数学推理、基于网页的问答和多跳推理等多个领域实现一致迁移，证明了编排级奖励建模作为鲁棒多智能体编排的可扩展方向的有效性。代码将发布于https://github.com/Wang-ML-Lab/OrchRM。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ICML2026】MASPO：面向基于大语言模型的多智能体系统的联合提示词优化

专知会员服务

12+阅读 · 5月9日

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

专知会员服务

29+阅读 · 2月27日

多智能体通信：多智能体强化学习到涌现语言和大语言模型的综述

专知会员服务

17+阅读 · 2月13日

【AAAI2026】AutoTool：面向大语言模型智能体的高效工具选择方法

专知会员服务

19+阅读 · 2025年11月19日