Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability

We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $\sqrt{m}$. For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m \geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the top principal components spontaneously group the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.

翻译：本研究探讨了Transformer模型学习由置换同余生成器（PCGs）生成序列的能力，PCGs是一类广泛使用的伪随机数生成器（PRNGs）。相较于线性同余生成器（LCGs），PCGs通过对隐藏状态施加一系列按位移位、异或运算、循环移位和截断操作，引入了显著的额外复杂度。我们证明，Transformer仍能成功对来自不同PCG变体的未见序列进行上下文预测，其任务难度已超越现有经典攻击方法。实验中，我们使用高达5000万模型参数和包含50亿标记的数据集，将模数规模扩展至$2^{22}$。令人惊讶的是，即使输出被截断为单个比特，模型仍能可靠预测。当训练过程中同时呈现多个不同的PRNGs时，模型能够联合学习它们，识别来自不同置换模式的结构特征。我们揭示了模数$m$的缩放规律：实现近乎完美预测所需的上下文序列元素数量以$\sqrt{m}$增长。对于更大模数，优化过程会进入长期停滞阶段；实验表明，学习$m \geq 2^{20}$的模数必须整合较小模数的训练数据，这证实了课程学习的关键必要性。最后，我们分析嵌入层并发现一种新颖的聚类现象：顶部主成分自发将整数输入分组为按位循环不变簇，揭示了表征如何从较小模数迁移至较大模数。

相关内容

生成器

关注 2

生成器是一次生成一个值的特殊类型函数。可以将其视为可恢复函数。调用该函数将返回一个可用于生成连续 x 值的生成【Generator】，简单的说就是在函数的执行过程中，yield语句会把你需要的值返回给调用生成器的地方，然后退出函数，下一次调用生成器函数的时候又从上次中断的地方开始执行，而生成器内的所有变量参数都会被保存下来供下一次使用。

Transformer为什么有效？Google最新《揭示变换器中的台阶优化算法》解释

专知会员服务

34+阅读 · 2023年9月13日

用Transformer学习通用超参数优化器，DeepMind Yutian Chen博士讲授，附Slides与视频

专知会员服务

40+阅读 · 2023年3月12日