Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs' reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles. We find that LLMs could interpolate and memorize the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet they struggle with slight variations of these puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. Through in-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers, we establish that LLMs develop reasoning skills on K&K puzzles alongside memorization. Finally, our analysis based on a per-sample memorization score sheds light on how LLMs switch between reasoning and memorization when solving logical puzzles. Our code and data are available at https://memkklogic.github.io.
翻译:大型语言模型(LLMs)在具有挑战性的推理基准测试中表现出色,但也可能犯基本的推理错误。当理解LLMs推理能力背后的机制时,这种矛盾行为令人困惑。一种假设是,在常见推理基准测试中日益提高且近乎饱和的性能,可能是由于对类似问题的记忆化。本文通过基于骑士与无赖(K&K)谜题动态生成的逻辑推理基准,系统性地研究了这一假设,并对推理任务中的记忆化进行了量化测量。我们发现,经过微调后,LLMs能够对训练谜题进行插值和记忆(达到近乎完美的准确率),但它们难以应对这些谜题的轻微变体。另一方面,我们证明虽然微调会导致严重的记忆化,但它也能持续提升泛化性能。通过扰动测试、跨难度级别可迁移性分析、模型内部探测以及使用错误答案进行微调等深入分析,我们证实LLMs在K&K谜题上发展出了与记忆化并存的推理技能。最后,基于逐样本记忆化分数的分析揭示了LLMs在解决逻辑谜题时如何在推理与记忆化之间切换。我们的代码和数据可在https://memkklogic.github.io获取。