The reasoning abilities of large language models (LLMs) are the topic of a growing body of research in artificial intelligence and cognitive science. In this paper, we probe the extent to which a dozen LLMs are able to distinguish logically correct inferences from logically fallacious ones. We focus on inference patterns involving conditionals (e.g., 'If Ann has a queen, then Bob has a jack') and epistemic modals (e.g., 'Ann might have an ace', 'Bob must have a king'). These inference patterns have been of special interest to logicians, philosophers, and linguists, since they plausibly play a central role in human reasoning. Assessing LLMs on these inference patterns is thus highly relevant to the question of how much the reasoning abilities of LLMs match those of humans. Among the LLMs we tested, all but GPT-4 often make basic mistakes with conditionals. Moreover, even GPT-4 displays logically inconsistent judgments across inference patterns involving epistemic modals.
翻译:大型语言模型的推理能力是人工智能与认知科学领域中日益增长的研究热点。本文探究了十二种大型语言模型在区分逻辑有效推理与逻辑谬误推理方面的能力水平。我们重点关注涉及条件语句(例如“如果安娜有皇后牌,则鲍勃有杰克牌”)和认知模态词(例如“安娜可能有A牌”“鲍勃必须有K牌”)的推理模式。这些推理模式对逻辑学家、哲学家和语言学家具有特殊研究价值,因其很可能在人类推理中扮演核心角色。评估LLM在这些推理模式上的表现,对于判断LLM的推理能力与人类推理的匹配程度具有重要参考意义。在测试的LLM中,除GPT-4外的所有模型在处理条件语句时均频繁出现基础性错误。此外,即便GPT-4在处理涉及认知模态词的推理模式时,也表现出逻辑不一致的判断结果。