Case-Based or Rule-Based: How Do Transformers Do the Math?

Despite the impressive performance in a variety of complex tasks, modern large language models (LLMs) still have trouble dealing with some math problems that are simple and intuitive for humans, such as addition. While we can easily learn basic rules of addition and apply them to new problems of any length, LLMs struggle to do the same. Instead, they may rely on similar cases seen in the training corpus for help. We define these two different reasoning mechanisms as "rule-based reasoning" and "case-based reasoning". Since rule-based reasoning is essential for acquiring systematic generalization ability, we aim to explore exactly whether transformers use rule-based or case-based reasoning for math problems. Through carefully designed intervention experiments on five math tasks, we confirm that transformers are performing case-based reasoning, no matter whether scratchpad is used, which aligns with the previous observations that transformers use subgraph matching/shortcut learning to reason. To mitigate such problems, we propose a Rule-Following Fine-Tuning (RFFT) technique to teach transformers to perform rule-based reasoning. Specifically, we provide explicit rules in the input and then instruct transformers to recite and follow the rules step by step. Through RFFT, we successfully enable LLMs fine-tuned on 1-5 digit addition to generalize to up to 12-digit addition with over 95% accuracy, which is over 40% higher than scratchpad. The significant improvement demonstrates that teaching LLMs to use rules explicitly helps them learn rule-based reasoning and generalize better in length.

翻译：尽管在各种复杂任务中表现出色，现代大规模语言模型在处理某些对人类而言简单直观的数学问题（如加法运算）时仍存在困难。人类可以轻松掌握加法的基本规则并将其应用于任意长度的新问题，而大语言模型却难以做到这一点。相反，它们可能依赖训练语料库中见过的类似案例进行推理。我们将这两种不同的推理机制定义为“基于规则的推理”和“基于案例的推理”。由于基于规则的推理对于获得系统泛化能力至关重要，我们旨在探究Transformer在处理数学问题时究竟采用何种推理机制。通过对五项数学任务精心设计的干预实验，我们证实无论是否使用思维链，Transformer均采用基于案例的推理方式，这与先前关于Transformer使用子图匹配/捷径学习进行推理的观察结果一致。为缓解此类问题，我们提出规则遵循微调技术来教导Transformer执行基于规则的推理。具体而言，我们在输入中提供显式规则，随后指导Transformer逐步复述并遵循这些规则。通过规则遵循微调，我们成功使仅在1-5位数加法数据上微调的大语言模型，能够泛化至最多12位数的加法运算，准确率超过95%，较思维链方法提升超过40%。这一显著改进表明，显式教导大语言模型使用规则有助于其学习基于规则的推理方式，并在长度维度上实现更好的泛化能力。