Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks

Large language models (LLMs) have demonstrated impressive versatility across numerous tasks, yet their generalization capabilities remain poorly understood. To investigate these behaviors, arithmetic tasks serve as important venues. In previous studies, seemingly unrelated mysteries still exist -- (1) models with appropriate positional embeddings can correctly perform longer unseen arithmetic operations such as addition, but their effectiveness varies in more complex tasks like multiplication; (2) models perform well for longer unseen cases in modular addition under specific moduli (e.g., modulo 100) but struggle under very close moduli (e.g., modulo 101), regardless of the positional encoding used. We believe previous studies have been treating the symptoms rather than addressing the root cause -- they have paid excessive attention to improving model components, while overlooking the differences in task properties that may be the real drivers. This is confirmed by our unified theoretical framework for different arithmetic scenarios. For example, unlike multiplication, the digital addition task has the property of translation invariance which naturally aligns with the relative positional encoding, and this combination leads to successful generalization of addition to unseen longer domains. The discrepancy in operations modulo 100 and 101 arises from the base. Modulo 100, unlike 101, is compatible with the decimal system (base 10), such that unseen information in digits beyond the units digit and the tens digit is actually not needed for the task. Extensive experiments with GPT-like models validate our theoretical predictions. These findings deepen our understanding of the generalization mechanisms, and facilitate more data-efficient model training and objective-oriented AI alignment.

翻译：大型语言模型（LLM）已在众多任务中展现出卓越的适应性，但其泛化能力仍未被充分理解。为探究这些行为特性，算术任务提供了重要的研究场景。先前研究中仍存在若干看似无关的谜题：（1）具备适当位置编码的模型能正确执行未见过的较长算术运算（如加法），但在乘法等更复杂任务中其有效性存在差异；（2）在特定模数（如模100）下，模型对未见过的较长模加法案例表现良好，但在极接近的模数（如模101）下却表现不佳，且该现象与所用位置编码方式无关。我们认为先前研究仅停留在表象层面而未触及根本原因——这些研究过度关注改进模型组件，却忽视了任务特性差异可能才是真正的驱动因素。我们针对不同算术场景构建的统一理论框架证实了这一点。例如，与乘法不同，数字加法任务具有平移不变性，这一特性天然契合相对位置编码，二者的结合使得加法能够成功泛化至未见过的更长数域。模100与模101之间的性能差异源于基数特性：模100与十进制系统（基数为10）兼容，而模101则不然，这使得任务实际上无需利用个位和十位以外数位的未知信息。基于类GPT模型的大量实验验证了我们的理论预测。这些发现深化了我们对泛化机制的理解，有助于实现更高效的数据驱动模型训练和目标导向的人工智能对齐。