Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall (crystallized intelligence) or reasoning ability (fluid intelligence). We introduce chess as a controlled testbed for disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in training corpus proximity--ranging from common states solvable by memorization to novel ones requiring first-principles reasoning. We systematically evaluate multiple GPT generations under varying reasoning intensities. Our analysis reveals a clear gradient: performance consistently degrades as fluid intelligence demands increase. Notably, in out-of-distribution tasks, performance collapses to random levels. While newer models improve, progress slows significantly for tasks outside the training distribution. Furthermore, while reasoning-augmented inference improves performance, its marginal benefit per token decreases with distributional proximity. These results suggest current architectures remain limited in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust fluid intelligence.
翻译:大语言模型展现出卓越的能力,但其在多大程度上反映了复杂的记忆(晶体智力)或推理能力(流体智力)仍不明确。我们引入国际象棋作为解构这两种认知能力的受控测试平台。利用该游戏的结构化特性与可扩展的引擎评估,我们构建了一个按训练语料邻近度划分的棋局分类体系——涵盖从可通过记忆解决的常见局面到需要基于第一性原理推理的新颖局面。我们在不同推理强度下系统评估了多代GPT模型。分析揭示出清晰的梯度规律:随着流体智力需求增加,模型性能持续下降。值得注意的是,在分布外任务中,模型性能会骤降至随机水平。虽然新模型有所改进,但对于训练分布之外的任务,其进步速度显著减缓。此外,尽管推理增强型推断能提升性能,但其每词元的边际效益随分布邻近度增加而递减。这些结果表明当前架构在系统性泛化方面仍存在局限,凸显了需要超越规模扩展的机制来实现稳健的流体智力。