We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $\sqrt{m}$. For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m \geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the top principal components spontaneously group the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.
翻译:本研究探讨了Transformer模型学习由置换同余生成器(PCGs)生成序列的能力,PCGs是一类广泛使用的伪随机数生成器(PRNGs)。相较于线性同余生成器(LCGs),PCGs通过对隐藏状态施加一系列按位移位、异或运算、循环移位和截断操作,引入了显著的额外复杂度。我们证明,Transformer仍能成功对来自不同PCG变体的未见序列进行上下文预测,其任务难度已超越现有经典攻击方法。实验中,我们使用高达5000万模型参数和包含50亿标记的数据集,将模数规模扩展至$2^{22}$。令人惊讶的是,即使输出被截断为单个比特,模型仍能可靠预测。当训练过程中同时呈现多个不同的PRNGs时,模型能够联合学习它们,识别来自不同置换模式的结构特征。我们揭示了模数$m$的缩放规律:实现近乎完美预测所需的上下文序列元素数量以$\sqrt{m}$增长。对于更大模数,优化过程会进入长期停滞阶段;实验表明,学习$m \geq 2^{20}$的模数必须整合较小模数的训练数据,这证实了课程学习的关键必要性。最后,我们分析嵌入层并发现一种新颖的聚类现象:顶部主成分自发将整数输入分组为按位循环不变簇,揭示了表征如何从较小模数迁移至较大模数。