Large Language Models (LLMs) have demonstrated human-like intelligence and are widely used in various applications. However, LLMs still exhibit various kinds of inconsistency problems. Existing works mainly focus on the inconsistency issues within a single LLM, while we investigate the inter-consistency among multiple LLMs, which is critical for collaborating to solve a complex task. To examine whether LLMs can collaborate to ultimately achieve a consensus for the shared goal and whether LLMs easily change their viewpoints, we introduce a Formal Debate framework (FORD) With FORD, we conduct a three-stage debate aligned with real-world scenarios: fair debate, mismatched debate, and roundtable debate. Through extensive experiments on the commonsense reasoning task, LLMs not only become more inter-consistent but also achieve higher performance. Moreover, we observe that stronger LLMs tend to dominate the debates by adhering to their perspectives, while weaker ones are more likely to change viewpoints. Additionally, we highlight the importance of a competent judge, such as GPT-4, to draw more proper conclusions. Our work contributes to understanding the inter-consistency among LLMs and lays the foundation for the development of future collaboration methods.
翻译:大型语言模型展现出类人智能并被广泛应用于各类场景,然而其仍存在多种不一致性问题。现有研究主要关注单一大语言模型的内部不一致性,而本研究探究多个大语言模型间的互一致性问题——这一特性对于协作解决复杂任务至关重要。为检验大语言模型能否通过协作最终就共同目标达成共识,以及模型是否容易改变自身观点,我们提出形式化辩论框架FORD。借助该框架,我们设计了三阶段辩论机制以模拟真实场景:公平辩论、非对称辩论与圆桌辩论。通过在常识推理任务上的大量实验发现,大语言模型不仅提升了互一致性,其任务表现也获得显著改善。实验进一步揭示:强模型倾向于坚持自身观点主导辩论进程,而弱模型更易改变立场。此外,我们强调具备能力的裁判(如GPT-4)对得出更恰当结论的关键作用。本研究加深了对大语言模型间互一致性的理解,并为未来协作方法的开发奠定基础。