MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS). Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models), evaluated across two levels of verification granularity (agent-level and iteration-level). We further examine five representative verifiers and four context management strategies, and conduct experiments over six diverse MAS frameworks on multiple reasoning benchmarks. We find that process-level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi-agent trajectories. Among the methods studied, LLM-as-a-Judge generally outperforms reward-based approaches, with trained judges surpassing general-purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context-length-performance trade-off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms. Code is available at https://github.com/Wang-ML-Lab/MAS-ProVe.

翻译：基于大语言模型（LLM）构建的多智能体系统（MAS）在其推理轨迹中常表现出高度方差。过程验证通过评估轨迹中的中间步骤，已在通用推理场景中展现出潜力，并被建议作为指导多智能体系统协调的潜在工具；然而，其在多智能体系统中的实际有效性尚不明确。为填补这一空白，我们提出了MAS-ProVe——一项针对多智能体系统过程验证的系统性实证研究。我们的研究涵盖三种验证范式（LLM-as-a-Judge、奖励模型和过程奖励模型），并在两种验证粒度（智能体级和迭代级）上进行评估。我们进一步检验了五种代表性验证器和四种上下文管理策略，并在多个推理基准上对六个不同的多智能体系统框架进行了实验。我们发现，过程级验证并不能持续提升性能，且经常表现出高方差，这凸显了可靠评估部分多智能体轨迹的困难。在所研究的方法中，LLM-as-a-Judge通常优于基于奖励的方法，且经过训练的法官模型超越了通用大语言模型。我们进一步观察到，大语言模型作为法官与作为单智能体之间存在较小的性能差距，并识别出验证中上下文长度与性能的权衡关系。总体而言，我们的结果表明，实现有效且鲁棒的多智能体系统过程验证仍是一个开放挑战，需要在现有范式之外取得进一步进展。代码发布于 https://github.com/Wang-ML-Lab/MAS-ProVe。