In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.
翻译:本文提出因果自回归扩散(CARD)框架,该框架将自回归模型(ARMs)的训练效率与扩散模型的高吞吐量推理能力相统一。CARD在严格因果注意力掩码下重构扩散过程,实现在单次前向传播中对每个词元进行密集监督。针对因果扩散的优化不稳定问题,我们引入软尾掩码方案以保留局部上下文,并基于信噪比原理提出上下文感知重加权机制。该设计支持动态并行解码:模型利用KV缓存技术,根据置信度自适应生成可变长度的词元序列。实验表明,CARD在超越现有离散扩散基线模型的同时,其训练延迟较块扩散方法降低3倍。我们的结果表明,CARD在保持ARM级数据效率的同时,实现了并行生成的延迟优势,为下一代高效大语言模型建立了稳健的范式。