The paradigm of Large Language Models (LLMs) is currently defined by auto-regressive (AR) architectures, which generate text through a sequential ``brick-by-brick'' process. Despite their success, AR models are inherently constrained by a causal bottleneck that limits global structural foresight and iterative refinement. Diffusion Language Models (DLMs) offer a transformative alternative, conceptualizing text generation as a holistic, bidirectional denoising process akin to a sculptor refining a masterpiece. However, the potential of DLMs remains largely untapped as they are frequently confined within AR-legacy infrastructures and optimization frameworks. In this Perspective, we identify ten fundamental challenges ranging from architectural inertia and gradient sparsity to the limitations of linear reasoning that prevent DLMs from reaching their ``GPT-4 moment''. We propose a strategic roadmap organized into four pillars: foundational infrastructure, algorithmic optimization, cognitive reasoning, and unified multimodal intelligence. By shifting toward a diffusion-native ecosystem characterized by multi-scale tokenization, active remasking, and latent thinking, we can move beyond the constraints of the causal horizon. We argue that this transition is essential for developing next-generation AI capable of complex structural reasoning, dynamic self-correction, and seamless multimodal integration.
翻译:当前大型语言模型(LLM)的范式由自回归(AR)架构主导,其通过“逐块堆砌”的序列化过程生成文本。尽管取得了成功,但AR模型本质上受限于因果瓶颈,制约了全局结构预见与迭代优化能力。扩散语言模型(DLM)提供了一种变革性替代方案,将文本生成概念化为整体性、双向的去噪过程,犹如雕塑家精雕杰作。然而,DLM的潜力尚未充分释放,因其常受限于AR遗留的基础设施与优化框架。本文提出阻碍DLM实现“GPT-4时刻”的十大核心挑战,涵盖架构惯性、梯度稀疏性至线性推理的局限性。我们构建了包含四大支柱的战略路线图:基础架构、算法优化、认知推理与统一多模态智能。通过转向以多尺度标记化、主动重掩码与潜空间思维为特征的扩散原生生态系统,我们将突破因果视野的约束。我们认为这一转型对开发具备复杂结构推理、动态自校正与无缝多模态融合能力的下一代人工智能至关重要。