The cognitive mechanism by which Large Language Models (LLMs) solve mathematical problems remains a widely debated and unresolved issue. Currently, there is little interpretable experimental evidence that connects LLMs' problem-solving with human cognitive psychology.To determine if LLMs possess human-like mathematical reasoning, we modified the problems used in the human Cognitive Reflection Test (CRT). Our results show that, even with the use of Chains of Thought (CoT) prompts, mainstream LLMs, including the latest o1 model (noted for its reasoning capabilities), have a high error rate when solving these modified CRT problems. Specifically, the average accuracy rate dropped by up to 50% compared to the original questions.Further analysis of LLMs' incorrect answers suggests that they primarily rely on pattern matching from their training data, which aligns more with human intuition (System 1 thinking) rather than with human-like reasoning (System 2 thinking). This finding challenges the belief that LLMs have genuine mathematical reasoning abilities comparable to humans. As a result, this work may adjust overly optimistic views on LLMs' progress towards artificial general intelligence.
翻译:大型语言模型(LLMs)解决数学问题的认知机制仍是一个广泛争论且尚未解决的问题。目前,鲜有可解释的实验证据能将LLMs的问题解决能力与人类认知心理学联系起来。为探究LLMs是否具备类人的数学推理能力,我们改进了人类认知反射测试(CRT)中使用的问题。研究结果表明,即使采用思维链(CoT)提示,包括以推理能力著称的最新o1模型在内的主流LLMs,在解决这些改进版CRT问题时仍具有较高的错误率。具体而言,相较于原始问题,其平均准确率最高下降达50%。对LLMs错误答案的进一步分析表明,它们主要依赖于训练数据中的模式匹配,这种行为更接近人类的直觉(系统1思维),而非类人的推理过程(系统2思维)。这一发现对“LLMs具备与人类相当的真正数学推理能力”的观点提出了挑战。因此,本研究可能有助于调整当前对LLMs迈向通用人工智能进程的过度乐观预期。