Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.
翻译:大语言模型在自然语言理解领域取得了显著成果,但作为单一智能体运行时,其推理能力仍存在局限。为应对这一挑战,研究者提出了多智能体辩论方法,通过圆桌辩论形式使多个模型协同推理。尽管该方法有效,但由于涉及多个智能体及频繁的通信需求,其带来了显著的计算开销。本文提出MARS(多智能体评审系统),一种受评审流程启发的基于角色的协作框架。在MARS中,作者智能体生成初始解决方案,评审智能体独立提供决策与评论,元评审智能体整合反馈以做出最终决策并指导后续修订。该设计既能提升推理质量,又避免了评审智能体之间的高成本互动,从而有效控制令牌消耗与推理时间。我们在多个基准测试中将MARS与多智能体辩论及其他先进推理策略进行了对比。基于不同大语言模型的大量实验表明,MARS在保持与多智能体辩论相同准确率的同时,将令牌使用量和推理时间均降低约50%。代码已开源至https://github.com/xwang97/MARS。