Large language models (LLMs) have demonstrated impressive zero-shot or few-shot commonsense reasoning performance on various natural language processing (NLP) tasks. However, despite their strong commonsense reasoning abilities, LLMs still exhibit various kinds of inconsistency problems. While previous researches mainly focused on the self-consistency within a single LLM, we propose to explore the inter-consistency issue between two or more LLMs, which is critical for diverse and precise decision-making processes. Since the LLMs possess human-like intelligence after instruction tuning and reinforcement learning with human feedback (RLHF), we design a formal debate framework to delve into the inter-consistency problem among LLMs with three-stage debate: fair debate, mismatched debate, and roundtable debate. Through extensive experiments on 7 commonsense reasoning datasets, LLMs not only become more inter-consistent by compromising and refuting but also achieve higher performance and stronger interpretability. Furthermore, we find a much stronger LLM would be dominant in mismatched debates, while it will be easily misled by relatively weaker LLMs in a more complex debate scenario such as roundtable debate.
翻译:大型语言模型(LLM)在各类自然语言处理(NLP)任务中展现出令人印象深刻的零样本或少样本常识推理能力。然而,尽管具备强大的常识推理能力,LLM仍表现出各种不一致性问题。以往研究主要关注单个LLM内部的自我一致性,我们提出探索两个或多个LLM之间的互一致性问题,这对于多元且精准的决策过程至关重要。由于LLM在指令微调和基于人类反馈的强化学习(RLHF)后具备了类人智能,我们设计了一个形式化的辩论框架,通过三阶段辩论(公平辩论、不匹配辩论和圆桌辩论)深入探究LLM间的互一致性问题。通过在7个常识推理数据集上的大量实验表明,LLM不仅通过妥协与反驳增强了互一致性,还实现了更高的性能和更强的可解释性。此外,我们发现更强大的LLM在不匹配辩论中占据主导地位,但在更复杂的辩论场景(如圆桌辩论)中,则容易被相对较弱的LLM误导。