Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD framework that treats debate as conditional computation. ARMOR-MAD combines three components: Pre-debate Agreement Routing (PAR) decides whether independently generated Round-0 answers require debate; Early Agreement Stopping Evaluator (EASE) stops debate after convergence; and Semantic Outlier Detection (SOD) down-weights abnormal final answers during aggregation. Across MATH Level 5, GSM8K, MMLU, and MMLU-Pro, ARMOR-MAD consistently improves over fixed-round heterogeneous debate with the same model pool, reaching 65.5\%, 96.5\%, 90.0\%, and 81.5\% accuracy, respectively. The results suggest that genuine model heterogeneity and agreement-based control are both important for making MAD more accurate and efficient.
翻译:摘要:多智能体辩论(MAD)能提升大语言模型推理能力,但固定辩论流水线往往浪费计算资源,且可能放大相似智能体间的关联错误。本文提出ARMOR-MAD——一种免训练的异构多智能体辩论框架,将辩论视为条件计算过程。该框架融合三大组件:辩前共识路由(PAR)判定独立生成的第0轮答案是否需要辩论;早期共识终止评估器(EASE)在达成收敛后终止辩论;以及语义离群检测(SOD)在结果聚合时降低异常终答的权重。在MATH Level 5、GSM8K、MMLU及MMLU-Pro四个基准测试中,ARMOR-MAD在使用相同模型池的前提下,相较固定轮次异构辩论持续取得提升,分别达到65.5%、96.5%、90.0%和81.5%的准确率。实验结果表明,真正的模型异构性与基于共识的控制机制对提升多智能体辩论的准确性与效率均至关重要。