The reasoning abilities of large language models (LLMs) are the topic of a growing body of research in AI and cognitive science. In this paper, we probe the extent to which twenty-five LLMs are able to distinguish logically correct inferences from logically fallacious ones. We focus on inference patterns involving conditionals (e.g., 'If Ann has a queen, then Bob has a jack') and epistemic modals (e.g., 'Ann might have an ace', 'Bob must have a king'). These inferences have been of special interest to logicians, philosophers, and linguists, since they play a central role in the fundamental human ability to reason about distal possibilities. Assessing LLMs on these inferences is thus highly relevant to the question of how much the reasoning abilities of LLMs match those of humans. Among the LLMs we tested, all but the GPT-4 model family often make basic mistakes with conditionals, though zero-shot chain-of-thought prompting helps them make fewer mistakes. Moreover, even the GPT-4 family displays logically inconsistent judgments across inference patterns involving epistemic modals, and almost all models give answers to certain complex conditional inferences widely discussed in the literature that do not match human judgments. These results highlight gaps in basic logical reasoning in today's LLMs.
翻译:大语言模型(LLMs)的推理能力已成为人工智能和认知科学研究中日益重要的课题。本文探究了25个大语言模型区分逻辑正确推理与逻辑谬误推理的能力程度。我们重点关注涉及条件句(例如“如果Ann有一张皇后牌,则Bob有一张杰克牌”)和认知模态词(例如“Ann可能有一张A牌”“Bob必须有一张国王牌”)的推理模式。这些推理模式因其在人类推理远端可能性的基本能力中扮演核心角色,一直受到逻辑学家、哲学家和语言学家的特别关注。因此,评估LLMs在这些推理上的表现,对于探究LLMs的推理能力在多大程度上与人类相匹配具有高度相关性。在我们测试的LLMs中,除GPT-4系列模型外,所有模型在处理条件句时经常犯基本错误,尽管零样本思维链提示有助于减少错误。此外,即使是GPT-4系列模型在处理涉及认知模态词的推理模式时也表现出逻辑不一致的判断,且几乎所有模型对文献中广泛讨论的某些复杂条件推理给出的答案与人类判断不符。这些结果凸显了当前大语言模型在基础逻辑推理方面存在的不足。