We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i.e., extended training far beyond overfitting. The levels of generalization also vary across reasoning types: when faced with out-of-distribution examples, transformers fail to systematically generalize for composition but succeed for comparison. We delve into the model's internals throughout training, conducting analytical experiments that reveal: 1) the mechanism behind grokking, such as the formation of the generalizing circuit and its relation to the relative efficiency of generalizing and memorizing circuits, and 2) the connection between systematicity and the configuration of the generalizing circuit. Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning.
翻译:我们研究Transformer是否能够学习对参数化知识进行隐式推理,这是即使最先进的语言模型也面临困难的技能。聚焦于组合与比较这两种代表性推理类型,我们一致发现Transformer能够学习隐式推理,但仅通过“顿悟”实现,即远超过拟合的长时间训练。泛化水平也随推理类型而异:面对分布外样本时,Transformer在组合任务上无法实现系统性泛化,但在比较任务上能够成功。我们深入探究了训练过程中模型的内部机制,通过分析性实验揭示:1)顿悟现象背后的机制,包括泛化回路的形成及其与记忆回路相对效率的关系;2)系统性与泛化回路配置之间的关联。我们的研究结果为数据和训练设置提供了指导,以更好地诱导隐式推理,并提出了Transformer架构的潜在改进方向,例如促进跨层知识共享。此外,我们证明对于具有大型搜索空间的复杂推理任务,基于非参数化记忆的GPT-4-Turbo和Gemini-1.5-Pro无论采用何种提示方式或检索增强均表现不佳,而完全实现顿悟的Transformer能够达到接近完美的准确率,这彰显了参数化记忆在复杂推理中的强大能力。