大型语言模型是否真正理解数学？基于认知心理学的实证探索 (Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology)

The cognitive mechanism by which Large Language Models (LLMs) solve mathematical problems remains a widely debated and unresolved issue. Currently, there is little interpretable experimental evidence that connects LLMs' problem-solving with human cognitive psychology.To determine if LLMs possess human-like mathematical reasoning, we modified the problems used in the human Cognitive Reflection Test (CRT). Our results show that, even with the use of Chains of Thought (CoT) prompts, mainstream LLMs, including the latest o1 model (noted for its reasoning capabilities), have a high error rate when solving these modified CRT problems. Specifically, the average accuracy rate dropped by up to 50% compared to the original questions.Further analysis of LLMs' incorrect answers suggests that they primarily rely on pattern matching from their training data, which aligns more with human intuition (System 1 thinking) rather than with human-like reasoning (System 2 thinking). This finding challenges the belief that LLMs have genuine mathematical reasoning abilities comparable to humans. As a result, this work may adjust overly optimistic views on LLMs' progress towards artificial general intelligence.

翻译：大型语言模型（LLMs）解决数学问题的认知机制仍是一个广泛争论且尚未解决的问题。目前，鲜有可解释的实验证据能将LLMs的问题解决能力与人类认知心理学联系起来。为探究LLMs是否具备类人的数学推理能力，我们改进了人类认知反射测试（CRT）中使用的问题。研究结果表明，即使采用思维链（CoT）提示，包括以推理能力著称的最新o1模型在内的主流LLMs，在解决这些改进版CRT问题时仍具有较高的错误率。具体而言，相较于原始问题，其平均准确率最高下降达50%。对LLMs错误答案的进一步分析表明，它们主要依赖于训练数据中的模式匹配，这种行为更接近人类的直觉（系统1思维），而非类人的推理过程（系统2思维）。这一发现对“LLMs具备与人类相当的真正数学推理能力”的观点提出了挑战。因此，本研究可能有助于调整当前对LLMs迈向通用人工智能进程的过度乐观预期。

相关内容

Cognition

关注 4

Cognition：Cognition：International Journal of Cognitive Science Explanation：认知：国际认知科学杂志。 Publisher：Elsevier。 SIT： http://www.journals.elsevier.com/cognition/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日