In the present study, we investigate and compare reasoning in large language models (LLM) and humans using a selection of cognitive psychology tools traditionally dedicated to the study of (bounded) rationality. To do so, we presented to human participants and an array of pretrained LLMs new variants of classical cognitive experiments, and cross-compared their performances. Our results showed that most of the included models presented reasoning errors akin to those frequently ascribed to error-prone, heuristic-based human reasoning. Notwithstanding this superficial similarity, an in-depth comparison between humans and LLMs indicated important differences with human-like reasoning, with models limitations disappearing almost entirely in more recent LLMs releases. Moreover, we show that while it is possible to devise strategies to induce better performance, humans and machines are not equally-responsive to the same prompting schemes. We conclude by discussing the epistemological implications and challenges of comparing human and machine behavior for both artificial intelligence and cognitive psychology.
翻译:本研究采用传统上用于研究(有限)理性的认知心理学工具,系统考察并比较了大语言模型(LLM)与人类的推理能力。为此,我们向人类参与者和一系列预训练大语言模型呈现了经典认知实验的新变体,并交叉比较了它们的表现。结果表明,大多数纳入研究的模型呈现出与人类易错、启发式推理中常被归因的错误相似的推理缺陷。然而,尽管存在这种表面相似性,人类与LLM的深度比较揭示了二者在类人推理方面的重要差异——在较新版本的LLM中,模型的局限性几乎完全消失。此外,我们证明:虽然可以设计策略以诱导更优表现,但人类与机器对相同提示方案的响应存在差异。最后,我们讨论了比较人类与机器行为对人工智能与认知心理学的认识论意义与挑战。