While Autoregressive (AR) Transformer-based Generative Language Models are frequently employed for lookahead tasks, recent research suggests a potential discrepancy in their ability to perform planning tasks that require multi-step lookahead. In this work, we investigate the distinct emergent mechanisms that arise when training AR versus Non-Autoregressive (NAR) models, such as Discrete Diffusion Models (dLLMs), on lookahead tasks. By requiring the models to plan ahead to reach the correct conclusion, we analyze how these two paradigms fundamentally differ in their approach to the problem. We identify a critical asymmetry in planning problems: while forward generation requires complex lookahead at branching junctions, reverse generation is often deterministic. This asymmetry creates an opportunity for NAR models. Through mechanistic analysis of training and inference dynamics, we demonstrate that NAR models learn to solve planning tasks by utilizing future tokens to decode backwards, avoiding the need to learn complex traversal mechanisms entirely. Consequently, we report that both AR and NAR models are able to achieve perfect accuracy on the lookahead task. However, NAR models require exponentially fewer training examples and shallower architectures compared to AR models, which often fail to converge without specific curriculum adjustments.
翻译:尽管基于自回归(AR)Transformer的生成语言模型常被用于前瞻任务,但近期研究表明,它们在执行需要多步前瞻的规划任务时可能存在能力差异。本研究探讨了在训练自回归模型与非自回归(NAR)模型(如离散扩散模型dLLMs)处理前瞻任务时,两者产生的不同涌现机制。通过要求模型进行前瞻规划以得出正确结论,我们分析了这两种范式在问题处理方式上的根本差异。我们发现规划问题中存在关键的非对称性:虽然前向生成在分支节点需要复杂的前瞻,但反向生成往往是确定性的。这种非对称性为非自回归模型创造了机会。通过对训练和推理动态的机制分析,我们证明非自回归模型通过利用未来标记进行反向解码来学习解决规划任务,从而完全避免了学习复杂遍历机制的需要。实验结果表明,自回归与非自回归模型均能在前瞻任务中实现完美准确率。然而,与需要特定课程调整才能收敛的自回归模型相比,非自回归模型所需的训练样本数量呈指数级减少,且网络结构更浅。