Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. Addressing the above limitation, we comprehensively evaluate the logical reasoning ability of LLMs on 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. To enable systematic evaluation, we introduce LogicBench, a natural language question-answering dataset focusing on the use of a single inference rule. We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion. We believe that our work and findings facilitate future research for evaluating and enhancing the logical reasoning ability of LLMs. Data and code are available at https://github.com/Mihir3009/LogicBench.

翻译：近年来发展的大型语言模型（LLMs）已在广泛的语言理解任务中展现出卓越性能。但它们在自然语言处理中是否真正具备"推理"能力？这一问题已引起大量研究关注，常识推理、数值推理和定性推理等多种推理技能已得到系统研究。然而，作为关键技能的"逻辑推理"仍未得到充分探索。现有针对LLMs推理能力的研究仅聚焦于命题逻辑和一阶逻辑中的少数推理规则（如肯定前件式和否定后件式）。为突破上述局限，我们从命题逻辑、一阶逻辑和非单调逻辑三个维度，系统评估了LLMs在25种不同推理模式上的逻辑推理能力。为实现系统化评估，我们提出了LogicBench——一个专注于单一推理规则使用的自然语言问答数据集。通过采用思维链提示方法，我们对GPT-4、ChatGPT、Gemini、Llama-2和Mistral等主流LLMs进行了深入分析。实验结果表明，现有LLMs在LogicBench上的表现不尽如人意，尤其在涉及复杂推理和否定规则的实例上存在明显不足。此外，它们有时会忽略得出正确结论所需的上下文推理信息。我们相信，本研究成果能推动未来针对LLMs逻辑推理能力评估与增强的研究。相关数据和代码已开源，见https://github.com/Mihir3009/LogicBench。