This paper investigates the problem-solving capabilities of Large Language Models (LLMs) by evaluating their performance on stumpers, unique single-step intuition problems that pose challenges for human solvers but are easily verifiable. We compare the performance of four state-of-the-art LLMs (Davinci-2, Davinci-3, GPT-3.5-Turbo, GPT-4) to human participants. Our findings reveal that the new-generation LLMs excel in solving stumpers and surpass human performance. However, humans exhibit superior skills in verifying solutions to the same problems. This research enhances our understanding of LLMs' cognitive abilities and provides insights for enhancing their problem-solving potential across various domains.
翻译:本文通过评估大型语言模型在破解难题(stumpers)上的表现,探究其问题解决能力。这些难题是独特的单步直觉问题,对人类解题者构成挑战,但易于验证。我们比较了四种最先进的大型语言模型(Davinci-2、Davinci-3、GPT-3.5-Turbo、GPT-4)与人类参与者的表现。研究结果表明,新一代大型语言模型在解决此类难题方面表现出色,超越了人类水平。然而,人类在验证同一问题的解决方案方面展现出更优秀的技能。本研究加深了我们对大型语言模型认知能力的理解,并为提升其在多个领域的问题解决潜力提供了启示。