LLaDA2.1: Speeding Up Text Diffusion via Token Editing

Tiwei Bie,Maosong Cao,Xiang Cao,Bingsen Chen,Fuyuan Chen,Kun Chen,Lun Du,Daozhuo Feng,Haibo Feng,Mingliang Gong,Zhuocheng Gong,Yanmei Gu,Jian Guan,Kaiyuan Guan,Hongliang He,Zenan Huang,Juyong Jiang,Zhonghui Jiang,Zhenzhong Lan,Chengxi Li,Jianguo Li,Zehuan Li,Huabin Liu,Lin Liu,Guoshan Lu,Yuan Lu,Yuxin Ma,Xingyu Mou,Zhenxuan Pan,Kaida Qiu,Yuji Ren,Jianfeng Tan,Yiding Tian,Zian Wang,Lanning Wei,Tao Wu,Yipeng Xing,Wentao Ye,Liangyu Zha,Tianze Zhang,Xiaolu Zhang,Junbo Zhao,Da Zheng,Hao Zhong,Wanli Zhong,Jun Zhou,Junlin Zhou,Liwang Zhu,Muzhi Zhu,Yihong Zhuang

from arxiv, 11 pages, 3 figures

While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T) scheme, we introduce a joint, configurable threshold-decoding scheme. This structural innovation gives rise to two distinct personas: the Speedy Mode (S Mode), which audaciously lowers the M2T threshold to bypass traditional constraints while relying on T2T to refine the output; and the Quality Mode (Q Mode), which leans into conservative thresholds to secure superior benchmark performances with manageable efficiency degrade. Furthering this evolution, underpinned by an expansive context window, we implement the first large-scale Reinforcement Learning (RL) framework specifically tailored for dLLMs, anchored by specialized techniques for stable gradient estimation. This alignment not only sharpens reasoning precision but also elevates instruction-following fidelity, bridging the chasm between diffusion dynamics and complex human intent. We culminate this work by releasing LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B). Across 33 rigorous benchmarks, LLaDA2.1 delivers strong task performance and lightning-fast decoding speed. Despite its 100B volume, on coding tasks it attains an astounding 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench.

翻译：尽管LLaDA2.0展示了百亿级别块扩散模型的扩展潜力及其固有的并行化能力，解码速度与生成质量之间的微妙平衡始终是一个难以企及的前沿领域。今天，我们推出LLaDA2.1，这是一种旨在超越此权衡的范式转变。通过将词元到词元（T2T）编辑无缝编织到传统的掩码到词元（M2T）方案中，我们引入了一种联合的、可配置的阈值解码方案。这一结构创新催生了两种不同的模式：速度模式（S模式），它大胆降低M2T阈值以绕过传统约束，同时依赖T2T来优化输出；以及质量模式（Q模式），它采用保守阈值以确保在可接受的效率损失下获得卓越的基准性能。在这一演进基础上，依托于扩展的上下文窗口，我们实现了首个专为dLLMs定制的大规模强化学习（RL）框架，并以用于稳定梯度估计的专门技术为锚点。这种对齐不仅提高了推理精度，还增强了指令遵循的保真度，弥合了扩散动力学与复杂人类意图之间的鸿沟。我们最终发布了LLaDA2.1-Mini（16B）和LLaDA2.1-Flash（100B）。在33项严格的基准测试中，LLaDA2.1展现出强大的任务性能和闪电般的解码速度。尽管其参数量达100B，在代码任务上，它在HumanEval+上实现了惊人的892 TPS，在BigCodeBench上达到801 TPS，在LiveCodeBench上达到663 TPS。