Can a mere next-token predictor faithfully model human intelligence? We crystallize this intuitive concern, which is fragmented in the literature. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner -- remarkably, despite the task being straightforward to learn. We provide preliminary evidence that this failure can be resolved when training to predict multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures
翻译:一个单纯的下一个词预测器能否忠实地建模人类智能?我们凝练了这一分散在文献中的直观担忧。作为起点,我们论证了下个词预测中两个常被混为一谈的阶段——自回归推理与教师强制训练——必须被区别对待。关于自回归推理期间错误可能累积的普遍批评,关键假设在于教师强制已习得准确的下个词预测器。这一假设回避了我们揭示的一个更深层问题:在特定任务类别中,教师强制本身就可能无法习得准确的下个词预测器。我们描述了教师强制失效的通用机制,并设计了一项最小规划任务——尽管该任务本身易于学习,但Transformer与Mamba架构均在此任务上以该方式经验性地失效。初步证据表明,通过训练提前预测多个词可解决此类失效。我们期待这一发现能为未来辩论奠定基础,并激发超越下个词预测范式的探索。我们的代码已开源至 https://github.com/gregorbachmann/Next-Token-Failures。