Large Language Models (LLMs) have shown remarkable capabilities in general natural language processing tasks but often fall short in complex reasoning tasks. Recent studies have explored human-like problem-solving strategies, such as self-correct, to push further the boundary of single-model reasoning ability. In this work, we let a single model "step outside the box" by engaging multiple models to correct each other. We introduce a multi-agent collaboration strategy that emulates the academic peer review process. Each agent independently constructs its own solution, provides reviews on the solutions of others, and assigns confidence levels to its reviews. Upon receiving peer reviews, agents revise their initial solutions. Extensive experiments on three different types of reasoning tasks show that our collaboration approach delivers superior accuracy across all ten datasets compared to existing methods. Further study demonstrates the effectiveness of integrating confidence in the reviews for math reasoning, and suggests a promising direction for human-mimicking multi-agent collaboration process.
翻译:大语言模型(LLMs)在通用自然语言处理任务中展现出卓越能力,但在复杂推理任务中仍显不足。近期研究探索类人问题解决策略(如自我修正)以突破单模型推理能力的边界。本研究通过引入多模型互纠机制,使单一模型"跳出固有思维框架"。我们提出一种模拟学术同行评审过程的多智能体协作策略:各智能体独立构建解决方案,对其他智能体的方案进行评审,并为评审结果赋予置信度等级。在接收同行评审意见后,智能体将修正其初始方案。基于三类不同推理任务的广泛实验表明,相较于现有方法,本协作策略在全部十个数据集上均取得了更优的准确率。进一步研究证实了数学推理任务中置信度评审机制的有效性,同时为类人多智能体协作流程指明了有前景的研究方向。