Language Models Need Inductive Biases to Count Inductively

Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of counting, investigating length generalization does occur throughout the literature. In the "train short, test long" paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially, generalizing counting inductively is central to success on OOD instances. This work provides extensive empirical results on training language models to count. We experiment with architectures ranging from RNNs, Transformers, State-Space Models and RWKV. We present carefully-designed task formats, auxiliary tasks and positional embeddings to avoid limitations in generalization with OOD-position and OOD-vocabulary. We find that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain. As counting is the basis for many arguments concerning the expressivity of Transformers, our finding calls for the community to reexamine the application scope of primitive functions defined in formal characterizations. Finally, modern RNNs also largely underperform traditional RNNs in generalizing counting inductively. We discuss how design choices that enable parallelized training of modern RNNs cause them to lose merits of a recurrent nature.

翻译：计数是泛化的基本示例，无论从定义自然数的皮亚诺公理数学视角，还是从儿童学习计数的认知科学文献视角来看皆是如此。两种观点都认为：学习计数意味着学习无限计数。尽管鲜有论文尝试将Transformer的"推理"能力提炼至最简单的计数场景，但针对长度泛化的研究在文献中始终存在。在自然语言处理的"短序列训练、长序列测试"范式中，长度指训练句子的长度；在形式语言识别中，长度指输入序列长度或下推自动机诱导的最大栈容量；在通用问题求解中，长度则指演绎推理链的跳数或递归深度。在所有场景中，计数都是任务成功的核心要素。而关键在于，实现归纳计数泛化是处理分布外实例成功的关键。本研究提供了训练语言模型进行计数的广泛实证结果，实验架构涵盖RNN、Transformer、状态空间模型和RWKV。我们设计了精细的任务格式、辅助任务与位置编码，以规避分布外位置和分布外词汇的泛化局限。研究发现：传统RNN能轻易实现归纳计数，而Transformer必须依赖位置编码才能处理分布外计数。鉴于计数是论证Transformer表达能力的基础，我们的发现呼吁学界重新审视形式化表征中原始函数的适用范围。最后，现代RNN在归纳计数泛化方面也显著逊色于传统RNN。我们论证了现代RNN为支持并行训练所作的设计选择，如何导致其丧失了循环架构的本质优势。