Despite their proficiency in math tasks, the mechanisms underlying LLMs' mathematical reasoning abilities remain a subject of debate. Recent studies suggest that chain-of-thought (CoT) prompts can bolster mathematical reasoning by encouraging LLMs to employ human-like logical reasoning (System 2), enabling them to excel on the Cognitive Reflection Test (CRT). To assess whether LLMs genuinely possess System 2-like logical reasoning, we introduced targeted modifications to CRT problems. Our findings reveal that, despite the use of CoT prompts, mainstream LLMs, including the latest o1-preview model, continue to exhibit a significant error rate. Further analysis indicates that they predominantly rely on System 1-like intuitive reasoning and pattern matching derived from training data, rather than demonstrating mastery of mathematical thinking. This discovery challenges the prevailing notion that LLMs possess genuine logical reasoning abilities and that CoT can enhance them. Consequently, this work may temper overly optimistic projections regarding LLMs' advancement toward artificial general intelligence.
翻译:尽管大型语言模型在数学任务上表现出色,但其数学推理能力的内在机制仍存在争议。近期研究表明,思维链提示可通过鼓励模型采用类人的逻辑推理(系统2)来增强数学推理能力,使其在认知反射测试中取得优异表现。为评估模型是否真正具备类系统2的逻辑推理能力,我们对认知反射测试题目进行了针对性修改。实验发现,即使使用思维链提示,包括最新的o1-preview模型在内的主流大语言模型仍存在显著错误率。进一步分析表明,这些模型主要依赖类系统1的直觉推理及从训练数据中习得的模式匹配,而非真正掌握数学思维。这一发现对当前普遍认为大语言模型具备真正逻辑推理能力、且思维链能增强该能力的观点提出了挑战。因此,本研究可能为关于大语言模型向通用人工智能发展的过度乐观预期提供警示。