Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.
翻译:基于扩散的语言模型(DLLMs)相较于自回归(AR)模型,提供了非顺序的块级生成和更丰富的数据复用能力,但在可比预算下,现有的代码DLLMs仍落后于强大的AR基线模型。我们通过一项受控研究重新审视了这一设定,并引入了Stable-DiffCoder,这是一个复用Seed-Coder架构、数据和训练流程的块扩散代码模型。为了实现高效的知识学习和稳定的训练,我们加入了一个块扩散持续预训练(CPT)阶段,该阶段通过定制的预热策略和块级裁剪噪声调度进行增强。在相同的数据和架构下,Stable-DiffCoder在一系列广泛的代码基准测试中整体上优于其对应的AR模型。此外,仅依靠CPT和监督微调阶段,Stable-DiffCoder就实现了比众多约8B参数的AR和DLLM模型更强的性能,这表明基于扩散的训练能够超越单纯的AR训练,提升代码建模质量。不仅如此,基于扩散的任意顺序建模改进了面向编辑和推理的结构化代码建模,并通过数据增强,使低资源编程语言受益。