Abstract reasoning is a key ability for an intelligent system. Large language models (LMs) achieve above-chance performance on abstract reasoning tasks, but exhibit many imperfections. However, human abstract reasoning is also imperfect. For example, human reasoning is affected by our real-world knowledge and beliefs, and shows notable "content effects"; humans reason more reliably when the semantic content of a problem supports the correct logical inferences. These content-entangled reasoning patterns play a central role in debates about the fundamental nature of human intelligence. Here, we investigate whether language models $\unicode{x2014}$ whose prior expectations capture some aspects of human knowledge $\unicode{x2014}$ similarly mix content into their answers to logical problems. We explored this question across three logical reasoning tasks: natural language inference, judging the logical validity of syllogisms, and the Wason selection task. We evaluate state of the art large language models, as well as humans, and find that the language models reflect many of the same patterns observed in humans across these tasks $\unicode{x2014}$ like humans, models answer more accurately when the semantic content of a task supports the logical inferences. These parallels are reflected both in answer patterns, and in lower-level features like the relationship between model answer distributions and human response times. Our findings have implications for understanding both these cognitive effects in humans, and the factors that contribute to language model performance.
翻译:抽象推理是智能系统的关键能力。大型语言模型(LM)在抽象推理任务上表现出高于随机水平的性能,但同时存在诸多不足。然而,人类的抽象推理同样不完美。例如,人类推理会受到现实世界知识与信念的影响,并表现出显著的"内容效应":当问题的语义内容支持正确逻辑推理时,人类的推理可靠性更高。这种内容与推理交织的模式在关于人类智能本质的争论中占据核心地位。本文探究了语言模型(其先验期望捕捉了人类知识的某些方面)是否会在逻辑问题的回答中呈现类似的内容混合现象。我们通过三项逻辑推理任务考察该问题:自然语言推理、三段论逻辑有效性判断以及沃森选择任务。对最先进的大型语言模型及人类受试者的评估发现,语言模型在这些任务中反映出诸多与人类相似的推理模式——当任务语义内容支持逻辑推理时,模型与人类同样表现更准确。这种平行性不仅体现在答案模式上,还反映在更底层的特征中,例如模型答案分布与人类反应时间之间的关系。本研究对于理解人类的认知效应以及影响语言模型性能的因素均具有启示意义。