The reasoning abilities of large language models (LLMs) are the topic of a growing body of research in AI and cognitive science. In this paper, we probe the extent to which twenty-nine LLMs are able to distinguish logically correct inferences from logically fallacious ones. We focus on inference patterns involving conditionals (e.g., 'If Ann has a queen, then Bob has a jack') and epistemic modals (e.g., 'Ann might have an ace', 'Bob must have a king'). These inferences have been of special interest to logicians, philosophers, and linguists, since they play a central role in the fundamental human ability to reason about distal possibilities. Assessing LLMs on these inferences is thus highly relevant to the question of how much the reasoning abilities of LLMs match those of humans. All the LLMs we tested make some basic mistakes with conditionals or modals, though zero-shot chain-of-thought prompting helps them make fewer mistakes. Even the best performing LLMs make basic errors in modal reasoning, display logically inconsistent judgments across inference patterns involving epistemic modals and conditionals, and give answers about complex conditional inferences that do not match reported human judgments. These results highlight gaps in basic logical reasoning in today's LLMs.
翻译:大型语言模型(LLMs)的推理能力已成为人工智能与认知科学研究中日益重要的课题。本文系统探究了29个LLMs区分逻辑正确推论与逻辑谬误的能力。我们重点关注涉及条件句(例如"如果Ann有皇后,则Bob有杰克")与认知模态词(例如"Ann可能有一张A牌"、"Bob必定有一张K牌")的推理模式。这些推理模式因其在人类推演远端可能性的基础认知能力中的核心作用,长期受到逻辑学家、哲学家和语言学家的特别关注。因此,评估LLMs在此类推理上的表现,对于探究LLMs与人类推理能力的匹配程度具有重要价值。实验发现:所有被测LLMs在条件句或模态词处理上均存在基础性错误,尽管零样本思维链提示能帮助其减少错误;即使性能最优的LLMs仍会出现模态推理的基本错误,在涉及认知模态词与条件句的推理模式中表现出逻辑不一致性,且对复杂条件推理给出的答案与已报道的人类判断存在偏差。这些结果揭示了当前LLMs在基础逻辑推理能力方面存在的显著缺陷。