Practical deployment of multi-agent systems (MAS) demands strong performance at test time, motivating methods that guide search during inference and selectively spend compute to improve quality. We present the Multi-Agent System Process Reward Model (MASPRM). It assigns values to partial inter-agent transcripts for each action and each agent, and acts as a controller during inference. MASPRM is trained from multi-agent Monte Carlo Tree Search (MCTS) rollouts labeled only with terminal outcome rewards, without requiring human step-level annotations, by propagating returns to local targets. During inference, MASPRM guides step-level beam search (SBS) and MCTS, focusing computation on promising branches and pruning unpromising ones. We train and test MASPRM across different tasks and domains, using GSM8K, MATH, MMLU, and LogiQA as benchmarks. Averaged across these benchmarks, MASPRM improves Hit@1 over policy likelihood by up to $+13.4$ points and improves ranking quality, reducing Hit@1$->$Hit@5 gaps by up to $10.3$ points. MASPRM complements inference-time search by scoring intermediate routed transcripts to guide rollouts in MAS with fixed schedules. Code: https://github.com/milad1378yz/MASPRM
翻译:多智能体系统(MAS)的实际部署要求在测试时具备强大性能,这推动了在推理过程中引导搜索并选择性分配计算资源以提升质量的方法。本文提出多智能体系统过程奖励模型(MASPRM)。该模型为每个智能体的每个动作分配部分交互记录的价值,并在推理过程中充当控制器。MASPRM通过将终端结果奖励传播至局部目标进行训练,其训练数据来源于仅标注终端奖励的多智能体蒙特卡洛树搜索(MCTS)推演,无需人工步骤级标注。在推理阶段,MASPRM引导步骤级束搜索(SBS)和MCTS,将计算资源集中于有潜力的分支并剪除无前景的分支。我们在不同任务和领域中训练并测试MASPRM,使用GSM8K、MATH、MMLU和LogiQA作为基准测试集。在这些基准测试的平均结果中,MASPRM将Hit@1指标较策略似然提升了最高$+13.4$分,并改善了排序质量,将Hit@1$->$Hit@5差距降低了最高$10.3$分。MASPRM通过对中间路由记录进行评分来引导固定调度MAS中的推演,从而对推理时搜索形成补充。代码:https://github.com/milad1378yz/MASPRM