Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome-llm-mas-rl, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.

翻译：随着大语言模型（LLM）智能体从孤立的工具使用者演变为协调协作的团队，强化学习（RL）不仅需要优化个体行为，还必须优化任务如何被生成、委派、通信、聚合和终止。本文通过编排轨迹研究基于LLM的多智能体系统的强化学习：编排轨迹是时间交互图，其事件包括子智能体生成、委派、通信、工具使用、返回、聚合和停止决策。以此视角，我们识别出三个技术维度。首先，奖励设计涵盖八个族类，包括用于并行加速、拆分正确性和聚合质量的编排奖励。其次，奖励和信用信号附着于从令牌到团队的八个信用或信号承载单元；在我们整理的文献池中，显式的反事实消息级信用分布仍然尤为稀疏。第三，编排学习分解为五个子决策：何时生成、委派给谁、如何通信、如何聚合以及何时停止。截至2026年5月4日，在我们整理的文献池中，未发现针对停止决策的显式RL训练方法。我们将学术方法与来自Kimi Agent Swarm、OpenAI Codex和Anthropic Claude Code的公开工业证据进行关联。由此产生的规模差距，是指公开报告的部署环境与开放性学术评估体系之间的差距，而非对工业训练轨迹的独立验证。我们在https://github.com/xxzcc/awesome-llm-mas-rl发布相关工件，包括含84条标记论文的文献池、32条排除记录日志、脚本化语料统计，以及用于可重放编排轨迹的简明JSON模式。