Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of "tit for tat" state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at https://github.com/Skytliang/Multi-Agents-Debate.
翻译:以ChatGPT为代表的现代大语言模型在通用语言任务上展现出卓越性能,但在复杂推理任务上仍存在困难,这推动了对大语言模型认知行为的研究,以探索类人问题解决策略。沿此方向,一个代表性策略是自我反思,即要求大语言模型基于自身迭代生成的反馈不断优化解决方案。然而,我们的研究表明此类反思式方法存在思维退化问题:一旦大语言模型对其解决方案建立信心,即使初始立场错误,也无法通过后续反思产生新思路。为解决思维退化问题,我们提出多智能体辩论框架,其中多个智能体以“针锋相对”的状态表达论点,并由裁判管理辩论过程以获得最终解决方案。显然,我们的多智能体辩论框架促进了大语言模型中的发散性思维,这对需要深度思考的任务具有助益。在两个挑战性数据集(常识机器翻译与反直觉算术推理)上的实验结果验证了多智能体辩论框架的有效性。深入分析表明,自适应辩论中断与适度水平的“针锋相对”状态是多智能体辩论取得良好性能的必要条件。此外,我们发现若对不同智能体使用不同大语言模型,则大语言模型可能无法成为公正裁判。代码发布于https://github.com/Skytliang/Multi-Agents-Debate。