Multi-Agent Debate~(MAD) has emerged as a promising paradigm for improving the performance of large language models through collaborative reasoning. Despite recent advances, the key factors driving MAD's effectiveness remain unclear. In this work, we disentangle MAD into two key components--Majority Voting and inter-agent Debate--and assess their respective contributions. Through extensive experiments across seven NLP benchmarks, we find that Majority Voting alone accounts for most of the performance gains typically attributed to MAD. To explain this, we propose a theoretical framework that models debate as a stochastic process. We prove that it induces a martingale over agents' belief trajectories, implying that debate alone does not improve expected correctness. Guided by these insights, we demonstrate that targeted interventions, by biasing the belief update toward correction, can meaningfully enhance debate effectiveness. Overall, our findings suggest that while MAD has potential, simple ensembling methods remain strong and more reliable alternatives in many practical settings. Code is released in https://github.com/deeplearning-wisc/debate-or-vote.
翻译:多智能体辩论(MAD)作为一种通过协作推理提升大语言模型性能的新兴范式,已展现出广阔前景。尽管近期研究取得了进展,但驱动MAD有效性的关键因素仍不明确。本研究将MAD解构为两个核心组成部分——多数投票与智能体间辩论——并评估其各自贡献。通过在七个自然语言处理基准测试上的大量实验,我们发现单独使用多数投票即可实现通常归因于MAD的大部分性能提升。为解释此现象,我们提出了一个将辩论建模为随机过程的理论框架。通过证明该过程会在智能体信念轨迹上形成鞅,我们论证了辩论本身并不能提升期望正确率。基于这些发现,我们进一步证明:通过使信念更新过程偏向修正方向的定向干预,能够实质性地提升辩论效能。总体而言,我们的研究表明,尽管MAD具备发展潜力,但在多数实际场景中,简单的集成方法仍是更稳健可靠的选择。代码已发布于 https://github.com/deeplearning-wisc/debate-or-vote。