Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP), exhibiting impressive achievements across various classic NLP tasks. However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. Firstly, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include three representative LLMs (i.e., text-davinci-003, ChatGPT and BARD) and evaluate them on all selected datasets under zero-shot, one-shot and three-shot settings. Secondly, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations from objective and subjective manners, covering both answers and explanations. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Thirdly, to avoid the influences of knowledge bias and purely focus on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. It contains 3,000 samples and covers deductive, inductive and abductive settings. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions. It reflects the pros and cons of LLMs and gives guiding directions for future works.

翻译：逻辑推理始终在知识工程和人工智能领域发挥着基础且重要的作用。近年来，大语言模型作为自然语言处理领域的一项显著创新，在各种经典自然语言处理任务中展现出令人瞩目的成就。然而，大语言模型能否有效应对需要类似人类智能的渐进式认知推理的逻辑推理任务，这一问题仍未得到解答。为此，本文旨在填补这一空白并提供全面评估。首先，为进行系统性评估，我们选取了十五个典型的逻辑推理数据集，并将其组织为演绎推理、归纳推理、溯因推理及混合推理四种场景。考虑到评估的全面性，我们纳入了三种代表性大语言模型（即text-davinci-003、ChatGPT和BARD），并在零样本、单样本和三样本设置下对所有选定数据集进行评测。其次，不同于以往仅依赖简单指标（如准确率）的评估，我们提出了从客观和主观两个角度的细粒度评估，涵盖答案与解释两方面。此外，为揭示大语言模型的逻辑缺陷，我们将问题案例归因于两个维度（证据选择过程和推理过程）的五种错误类型。第三，为避免知识偏差的影响并纯粹聚焦于大语言模型逻辑推理能力的基准测试，我们提出了一个包含中性内容的新数据集。该数据集包含3000个样本，覆盖演绎推理、归纳推理和溯因推理三种场景。基于深度评估，本文最终从六个维度构建了逻辑推理能力的通用评估方案，该方案既反映了大语言模型的优劣，也为未来研究提供了指导方向。