This paper investigates the rational thinking capability of Large Language Models (LLMs) in multi-round argumentative debates by exploring the impact of fallacious arguments on their logical reasoning performance. More specifically, we present Logic Competence Measurement Benchmark (LOGICOM), a diagnostic benchmark to assess the robustness of LLMs against logical fallacies. LOGICOM involves two agents: a persuader and a debater engaging in a multi-round debate on a controversial topic, where the persuader tries to convince the debater of the correctness of its claim. First, LOGICOM assesses the potential of LLMs to change their opinions through reasoning. Then, it evaluates the debater's performance in logical reasoning by contrasting the scenario where the persuader employs logical fallacies against one where logical reasoning is used. We use this benchmark to evaluate the performance of GPT-3.5 and GPT-4 using a dataset containing controversial topics, claims, and reasons supporting them. Our findings indicate that both GPT-3.5 and GPT-4 can adjust their opinion through reasoning. However, when presented with logical fallacies, GPT-3.5 and GPT-4 are erroneously convinced 41% and 69% more often, respectively, compared to when logical reasoning is used. Finally, we introduce a new dataset containing over 5k pairs of logical vs. fallacious arguments. The source code and dataset of this work are made publicly available.
翻译:本文通过研究谬误论证对逻辑推理性能的影响,探讨大型语言模型(LLM)在多轮辩论中的理性思维能力。具体而言,我们提出了逻辑能力测量基准(LOGICOM),这是一个评估LLM对逻辑谬误鲁棒性的诊断基准。LOGICOM包含两个智能体:一个说服者与一个辩论者,双方围绕争议性话题展开多轮辩论,说服者试图使辩论者相信其主张的正确性。首先,LOGICOM评估LLM通过推理改变观点的潜力。随后,通过对比说服者使用逻辑谬误与逻辑推理两种场景,评估辩论者的逻辑推理性能。我们利用包含争议性话题、主张及其支持理由的数据集,评估GPT-3.5与GPT-4的性能。研究结果表明:GPT-3.5与GPT-4均可通过推理调整自身观点。然而,与使用逻辑推理相比,当面对逻辑谬误时,GPT-3.5与GPT-4被错误说服的比例分别高出41%和69%。最后,我们提出了一个包含超过5000对逻辑与谬误论证的新数据集。本研究的源代码与数据集均已公开。