Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc lookups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next token. Theoretically, we show that these latents provably converge towards belief states, compressed information about the history necessary to predict the future. This simple auxiliary objective injects a recurrent inductive bias into transformers while leaving their architecture, parallel training efficiency, and inference unchanged. NextLat effectively encourages transformers to form compact internal world models with coherent belief states and transition dynamics -- crucial properties not guaranteed by standard next-token prediction alone. Empirically, across benchmarks in world modeling, reasoning, planning, and language modeling, NextLat demonstrates significant gains over standard next-token prediction and other baselines in downstream accuracy, representation compression, and lookahead planning. Furthermore, NextLat enables variable-length self-speculative decoding, accelerating inference by up to 3.3x in language modeling. NextLat offers a simple yet effective paradigm for learning compact, predictive representations in transformers that generalize better. Our code is available at https://github.com/JaydenTeoh/NextLat.
翻译:变换器通过随序列长度增长的记忆体以及能够对过去词元进行即时查找的自注意力机制,取代了循环结构。因此,它们缺乏将历史信息压缩为具有一致转移规则的紧凑隐状态的内在动机,这往往导致学习到的解决方案泛化能力较差。我们提出下一代隐状态预测(NextLat),该方法通过潜在空间中的自监督预测扩展了标准的下一个词元训练。具体而言,NextLat训练变换器学习能够根据下一个词元预测其下一个隐状态的潜在表示。理论上,我们证明了这些隐状态能够确定地收敛到信念状态,即预测未来所需的历史压缩信息。这一简单的辅助目标为变换器注入了循环归纳偏置,同时保持其架构、并行训练效率和推理方式不变。NextLat有效鼓励变换器形成具有一致信念状态和转移动力学的紧凑内部世界模型——这是标准下一个词元预测本身无法保证的关键特性。在实验方面,跨越世界建模、推理、规划和语言建模的多个基准测试中,NextLat在下游准确率、表示压缩和前瞻规划方面均显著优于标准下一个词元预测及其他基线方法。此外,NextLat支持可变长度的自推测解码,在语言建模中可实现高达3.3倍的推理加速。NextLat提供了一种简单而有效的范式,用于在变换器中学习泛化能力更强的紧凑预测表示。我们的代码开源在https://github.com/JaydenTeoh/NextLat。