While multiagent systems have shown promise for tackling complex tasks via specialization, finetuning multiple agents simultaneously faces two key challenges: (1) credit assignment across agents, and (2) sample efficiency of expensive multiagent rollouts. In this work, we propose finetuning multiagent systems with per-action process rewards from AI feedback (MAPPA) to address both. Through assigning credit to individual agent actions rather than only at task completion, MAPPA enables fine-grained supervision without ground truth labels while extracting maximal training signal from each rollout. We demonstrate our approach on competition math problems and tool-augmented data analysis tasks. On unseen math problems, MAPPA achieves +5.0--17.5pp on AIME and +7.8--17.2pp on AMC. For data analysis tasks, our method improves success rate by +16.7pp while quality metrics improve by up to 47%, validating that per-action supervision can lead to improvements across different multiagent systems on various domains. By addressing these challenges, our work takes a first step toward scaling multiagent systems for complex, long-horizon tasks with minimal human supervision.
翻译:尽管多智能体系统通过专业化分工在处理复杂任务方面展现出潜力,但同时微调多个智能体仍面临两大关键挑战:(1) 跨智能体的信用分配问题,(2) 昂贵多智能体推演的样本效率。本研究提出利用人工智能反馈的逐动作过程奖励(MAPPA)来微调多智能体系统以应对上述挑战。通过将信用分配给单个智能体动作而非仅在任务完成时分配,MAPPA能够在无需真实标签的情况下实现细粒度监督,同时从每次推演中提取最大化的训练信号。我们在数学竞赛题和工具增强数据分析任务上验证了该方法。在未见数学问题上,MAPPA在AIME上提升5.0-17.5个百分点,在AMC上提升7.8-17.2个百分点。对于数据分析任务,本方法将成功率提高16.7个百分点,质量指标提升最高达47%,证明逐动作监督能够推动不同领域多智能体系统的全面改进。通过解决这些挑战,我们的工作为在最小人工监督下扩展多智能体系统处理复杂长周期任务迈出了第一步。