Large Language Models (LLMs) have shown impressive capabilities in various applications, but they still face various inconsistency issues. Existing works primarily focus on the inconsistency issues within a single LLM, while we complementarily explore the inter-consistency among multiple LLMs for collaboration. To examine whether LLMs can collaborate effectively to achieve a consensus for a shared goal, we focus on commonsense reasoning, and introduce a formal debate framework (FORD) to conduct a three-stage debate among LLMs with real-world scenarios alignment: fair debate, mismatched debate, and roundtable debate. Through extensive experiments on various datasets, LLMs can effectively collaborate to reach a consensus despite noticeable inter-inconsistencies, but imbalances in their abilities can lead to domination by superior LLMs. Leveraging a more advanced LLM like GPT-4 as an authoritative judge can boost collaboration performance. Our work contributes to understanding the inter-consistency among LLMs and lays the foundation for developing future collaboration methods. Codes and data are available at https://github.com/Waste-Wood/FORD
翻译:大型语言模型(LLMs)在各种应用中展现出了卓越能力,但仍面临多种不一致性问题。现有研究主要关注单一LLM内部的不一致性问题,而我们则从互补角度探索多LLM协作中的交互一致性。为检验LLMs能否有效协作以达成共识实现共同目标,我们聚焦常识推理任务,提出正式辩论框架(FORD),设计了三阶段辩论流程以对齐真实场景:公平辩论、错配辩论与圆桌辩论。通过在多个数据集上的广泛实验表明,尽管存在显著交互不一致性,LLMs仍能有效协作达成共识,但能力失衡会导致优势LLM主导协作过程。利用更先进的LLM(如GPT-4)作为权威裁判可提升协作性能。本研究有助于理解LLMs间的交互一致性,为未来协作方法的开发奠定基础。代码与数据见 https://github.com/Waste-Wood/FORD