On Memorization of Large Language Models in Logical Reasoning

Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs' reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles. We found that LLMs could interpolate the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet fail when those puzzles are slightly perturbed, suggesting that the models heavily rely on memorization to solve those training puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. In-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers suggest that the LLMs learn to reason on K&K puzzles despite training data memorization. This phenomenon indicates that LLMs exhibit a complex interplay between memorization and genuine reasoning abilities. Finally, our analysis with per-sample memorization score sheds light on how LLMs switch between reasoning and memorization in solving logical puzzles. Our code and data are available at https://memkklogic.github.io.

翻译：大型语言模型（LLMs）在具有挑战性的推理基准测试中表现出良好性能，但也可能犯基本的推理错误。当理解LLMs推理能力背后的机制时，这种矛盾行为令人困惑。一种假设是，在常见推理基准测试上日益提高且近乎饱和的性能，可能源于对类似问题的记忆化。本文通过动态生成的基于骑士与无赖（K&K）谜题的逻辑推理基准，系统性地研究了这一假设，并对推理任务中的记忆化进行了量化测量。我们发现，经过微调后，LLMs能够对训练谜题进行插值（达到近乎完美的准确率），但当这些谜题被轻微扰动时却会失败，这表明模型严重依赖记忆化来解决训练谜题。另一方面，我们证明，虽然微调导致严重的记忆化，但它也持续提高了泛化性能。通过扰动测试、跨难度级别可迁移性、探测模型内部机制以及使用错误答案进行微调的深入分析表明，尽管存在训练数据记忆化，LLMs仍能学会对K&K谜题进行推理。这一现象表明，LLMs在记忆化与真实推理能力之间展现出复杂的相互作用。最后，我们通过逐样本记忆化分数的分析，揭示了LLMs在解决逻辑谜题时如何在推理与记忆化之间切换。我们的代码和数据可在 https://memkklogic.github.io 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/