We initiate an investigation into the optimization properties of next-token prediction (NTP), the dominant training paradigm for modern language models. Specifically, we study the structural properties of the solutions selected by gradient-based optimizers among the many possible minimizers of the NTP objective. By framing NTP as cross-entropy minimization across distinct contexts, each tied with a sparse conditional probability distribution across a finite vocabulary of tokens, we introduce "NTP-separability conditions" that enable reaching the data-entropy lower bound. With this setup, and focusing on linear models with fixed context embeddings, we characterize the optimization bias of gradient descent (GD): Within the data subspace defined by the sparsity patterns of distinct contexts, GD selects parameters that equate the logits' differences of in-support tokens to their log-odds. In the orthogonal subspace, the GD parameters diverge in norm and select the direction that maximizes a margin specific to NTP. These findings extend previous research on implicit bias in one-hot classification to the NTP setting, highlighting key differences and prompting further research into the optimization and generalization properties of NTP, irrespective of the specific architecture used to generate the context embeddings.
翻译:我们首次对下一词预测(NTP)的优化特性展开研究,这是现代语言模型的主要训练范式。具体而言,我们研究了基于梯度的优化器在NTP目标众多可能的最小化解中所选解的结构特性。通过将NTP框架化为跨不同上下文的交叉熵最小化问题——其中每个上下文与有限词表上稀疏的条件概率分布相关联,我们引入了“NTP可分离性条件”,该条件使得达到数据熵下界成为可能。在此设定下,并聚焦于具有固定上下文嵌入的线性模型,我们刻画了梯度下降(GD)的优化偏差:在由不同上下文的稀疏模式定义的数据子空间内,GD选择的参数将使支持集内词元的对数几率差等于其对数优势比。在正交子空间中,GD参数在范数上发散,并选择最大化NTP特定间隔的方向。这些发现将先前关于独热编码分类中隐式偏差的研究扩展至NTP场景,突显了关键差异,并推动了对NTP优化与泛化特性的进一步研究,而不依赖于生成上下文嵌入的具体架构。