Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.
翻译:语言模型在自然文本上训练时,通过周期为$T=2,5,10$的主导周期特征来学习数值表示。本文揭示了这些特征的双层层级结构:尽管Transformer、线性RNN、LSTM以及以不同方式训练的经典词嵌入都能在傅里叶域中学习到具有周期$T$尖峰的特征,但仅有部分模型能够习得可用于线性分类模$T$数值的几何可分特征。为解释这一矛盾,我们证明了傅里叶域稀疏性是实现模$T$几何可分的必要条件而非充分条件。通过实证研究,我们探究了模型在何种训练条件下能获得几何可分特征,发现数据、架构、优化器与分词器均起关键作用。特别地,我们识别出模型获取几何可分特征的两种不同途径:从通用语言数据中的互补共现信号(包括文本-数值共现与跨数值交互)中学习,或通过多词元(非单词元)加法问题习得。总体而言,我们的结果凸显了特征学习中的趋同演化现象:不同模型能从迥异的训练信号中学习到相似特征。