While multiagent systems have shown promise for tackling complex tasks via specialization, finetuning multiple agents simultaneously faces two key challenges: (1) credit assignment across agents, and (2) sample efficiency of expensive multiagent rollouts. In this work, we propose finetuning multiagent systems with per-action process rewards from AI feedback (MAPPA) to address both. Through assigning credit to individual agent actions rather than only at task completion, MAPPA enables fine-grained supervision without ground truth labels while extracting maximal training signal from each rollout. We demonstrate our approach on competition math problems and tool-augmented data analysis tasks. On unseen math problems, MAPPA achieves +5.0--17.5pp on AIME and +7.8--17.2pp on AMC. For data analysis tasks, our method improves success rate by +12.5pp while quality metrics improve by up to 30%, validating that per-action supervision can lead to improvements across different multiagent system on various domains. By addressing these challenges, our work takes a first step toward scaling multiagent systems for complex, long-horizon tasks with minimal human supervision.
翻译:尽管多智能体系统通过专业化分工在处理复杂任务方面展现出潜力,但同时微调多个智能体面临两个关键挑战:(1) 跨智能体的信用分配问题,(2) 昂贵多智能体推演的样本效率问题。本研究提出通过人工智能反馈的逐动作过程奖励(MAPPA)来微调多智能体系统,以同时解决这两个挑战。MAPPA通过将信用分配给单个智能体动作而非仅在任务完成时分配,实现了无需真实标签的细粒度监督,同时从每次推演中提取最大化的训练信号。我们在数学竞赛题和工具增强的数据分析任务上验证了该方法。在未见过的数学问题上,MAPPA在AIME上实现了+5.0-17.5个百分点的提升,在AMC上实现了+7.8-17.2个百分点的提升。对于数据分析任务,我们的方法将成功率提高了+12.5个百分点,质量指标提升高达30%,验证了逐动作监督能够推动不同领域多智能体系统的全面改进。通过解决这些挑战,我们的工作为在最小化人工监督条件下扩展多智能体系统处理复杂长程任务迈出了第一步。