Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation From Deductive, Inductive and Abductive Views

Large Language Models (LLMs) have achieved great success in various natural language tasks. It has aroused much interest in evaluating the specific reasoning capability of LLMs, such as multilingual reasoning and mathematical reasoning. However, as one of the key reasoning perspectives, logical reasoning capability has not yet been thoroughly evaluated. In this work, we aim to bridge those gaps and provide comprehensive evaluations. Firstly, to offer systematic evaluations, this paper selects fifteen typical logical reasoning datasets and organizes them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include three representative LLMs (i.e., text-davinci-003, ChatGPT and BARD) and evaluate them on all selected datasets under zero-shot, one-shot and three-shot settings. Secondly, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations from objective and subjective manners, covering both answers and explanations. Also, to uncover the logical flaws of LLMs, bad cases will be attributed to five error types from two dimensions. Thirdly, to avoid the influences of knowledge bias and purely focus on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. It contains 3K samples and covers deductive, inductive and abductive reasoning settings. Based on the in-depth evaluations, this paper finally concludes the ability maps of logical reasoning capability from six dimensions (i.e., correct, rigorous, self-aware, active, oriented and no hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.

翻译：大型语言模型（LLMs）在各种自然语言任务中取得了巨大成功。这激发了人们对评估LLMs特定推理能力（如多语言推理和数学推理）的浓厚兴趣。然而，作为关键推理视角之一的逻辑推理能力尚未得到充分评估。本研究旨在填补这些空白并提供全面评估。首先，为了进行系统性评估，本文选取了十五个典型的逻辑推理数据集，并将其组织成演绎、归纳、溯因及混合形式的推理场景。考虑到评估的全面性，我们纳入了三种代表性的LLMs（即text-davinci-003、ChatGPT和BARD），并在零样本、单样本和三个样本设置下对所有选定数据集进行了评估。其次，不同于以往仅依赖简单指标（如准确率）的评估，我们提出了从主观和客观两个角度进行的细粒度评估，涵盖答案和解释。同时，为了揭示LLMs的逻辑缺陷，不良案例将从两个维度归因于五种错误类型。再次，为避免知识偏差的影响并专注于基准测试LLMs的逻辑推理能力，我们提出了一个包含中性内容的新数据集。该数据集包含3K个样本，覆盖演绎、归纳和溯因推理场景。基于深入评估，本文最终从六个维度（即正确性、严谨性、自省性、主动性、导向性及无幻觉性）绘制了逻辑推理能力图谱。这反映了LLMs的优缺点，并为未来工作提供了指导方向。