LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.
翻译:UnityMAS-O:基于大语言模型的多智能体系统通用强化学习优化框架。基于大语言模型的多智能体系统将复杂任务分解为相互交互的角色,但大多数系统仍通过提示、工具和控制规则进行手动编排,智能体很少通过统一的强化学习接口进行优化。现有强化学习后训练框架主要针对单策略优化,缺乏对用户定义的多智能体工作流、结构化交互、角色特定信用分配和可配置参数共享的抽象支持。我们提出UnityMAS-O,一个面向大语言模型多智能体系统的通用强化学习优化框架。UnityMAS-O将完整工作流而非单一响应或策略轨迹作为优化单元。它通过四类一等对象表示工作流:逻辑智能体角色、图结构轨迹、用户定义奖励和智能体-模型映射。这种设计将逻辑智能体与物理模型参数解耦,支持完全共享、完全分离和部分共享三种模式,并在角色级、轮次级和轨迹级分配奖励。UnityMAS-O扩展了verl框架,采用基于Ray的星型拓扑运行时架构。中央控制器执行工作流、调用工具、记录结构化轨迹并汇总奖励;模型本地工作组负责策略轨迹生成、缓冲、优势计算和分布式PPO风格更新。用户无需重写优化基础设施即可定义智能体、工作流、模型映射和奖励。我们在检索增强问答、迭代式智能体搜索和反射式代码生成任务上实例化UnityMAS-O。在Natural Questions、HotpotQA和保留代码任务上的实验表明,经优化后多智能体强化学习能改进人工指定工作流性能,尤其对小型模型和严格代码全通过指标提升显著。这些结果表明UnityMAS-O可作为可复用基础框架,将多样化的大语言模型多智能体工作流转化为可训练的多智能体强化学习系统。