Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.
翻译:基于扩散的语言模型(DLLMs)相较于自回归(AR)模型,提供了非顺序的、块级生成方式以及更丰富的数据复用能力,但现有的代码DLLMs在可比预算下仍落后于强大的AR基线模型。我们在一项受控研究中重新审视了这一设定,并引入了Stable-DiffCoder,这是一个复用Seed-Coder架构、数据和训练流程的块扩散代码模型。为了实现高效的知识学习和稳定的训练,我们引入了一个块扩散持续预训练(CPT)阶段,该阶段通过定制的预热策略和块级裁剪噪声调度进行增强。在相同数据和架构下,Stable-DiffCoder在一系列广泛的代码基准测试中整体上优于其AR对应模型。此外,仅依靠CPT和监督微调阶段,Stable-DiffCoder实现了比一系列广泛的约80亿参数AR和DLLM模型更强的性能,这表明基于扩散的训练可以超越单独的AR训练,提升代码建模质量。此外,基于扩散的任意顺序建模改进了用于编辑和推理的结构化代码建模,并通过数据增强,使低资源编程语言受益。