Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP), exhibiting impressive achievements across various classic NLP tasks. However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. Firstly, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include three representative LLMs (i.e., text-davinci-003, ChatGPT and BARD) and evaluate them on all selected datasets under zero-shot, one-shot and three-shot settings. Secondly, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations from objective and subjective manners, covering both answers and explanations. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Thirdly, to avoid the influences of knowledge bias and purely focus on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. It contains 3,000 samples and covers deductive, inductive and abductive settings. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions. It reflects the pros and cons of LLMs and gives guiding directions for future works.

翻译：逻辑推理一直是知识工程与人工智能领域中基础且重要的核心能力。近年来，大语言模型作为自然语言处理领域的重大创新，在各类经典自然语言处理任务中展现出令人瞩目的成就。然而，大语言模型能否有效解决需要类似人类智能的渐进式认知推理的逻辑推理任务，这一问题仍悬而未决。为此，本文旨在弥补这一研究空白并提供系统性评估。首先，为开展系统化评估，我们选取十五个典型逻辑推理数据集，将其划分为演绎推理、归纳推理、溯因推理及混合推理四类场景。为确保评估全面性，我们选取三个代表性大语言模型（即text-davinci-003、ChatGPT和BARD），在零样本、单样本及三样本设置下对所有选定数据集进行评测。其次，不同于此前仅依赖简单指标（如准确率）的评估，我们从客观与主观两个维度提出精细化评估方案，涵盖答案与解释双重层面。此外，为揭示大语言模型的逻辑缺陷，我们将问题案例归因于证据选择过程与推理过程两个维度的五种错误类型。第三，为消除知识偏差影响并纯粹聚焦大语言模型逻辑推理能力的基准测试，我们提出包含中性内容的新数据集。该数据集包含3000个样本，覆盖演绎、归纳与溯因三类推理场景。基于深度评估，本文最终构建了涵盖六个维度的逻辑推理能力通用评估框架，既反映大语言模型的优劣得失，也为未来研究指明方向。