DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Diffusion-based decoding has recently emerged as an appealing alternative to autoregressive (AR) generation, offering the potential to update multiple tokens in parallel and reduce latency. However, diffusion vision language models (dVLMs) still lag significantly behind mainstream autoregressive vision language models. This is due to the scarcity and weaker performance of base diffusion language models (dLLMs) compared with their autoregressive counterparts. This raises a natural question: Can we build high-performing dVLMs directly from existing powerful AR models, without relying on dLLMs? We propose DiffusionVL, a family of dVLMs obtained by translating pretrained AR models into the diffusion paradigm via an efficient diffusion finetuning procedure that changes the training objective and decoding process while keeping the backbone architecture intact. Through an efficient diffusion finetuning strategy, we successfully adapt AR pretrained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance comparable to that of the same AR model finetuned with standard autoregressive visual instruction tuning. To enable practical open-ended generation, we further integrate block decoding, which supports arbitrary-length outputs and KV-cache reuse for faster inference. Our experiments demonstrate that despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement, with a 34.4% gain on the MMMU-Pro (vision) benchmark and 37.5% gain on the MME (Cog.) benchmark, alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.

翻译：基于扩散的解码方法近期成为自回归生成的一种有吸引力的替代方案，具有并行更新多个token及降低延迟的潜力。然而，扩散视觉语言模型仍显著落后于主流自回归视觉语言模型，其原因在于基础扩散语言模型相较于自回归模型更稀缺且性能较弱。这自然引发了一个问题：我们能否直接基于现有强大的自回归模型构建高性能的扩散视觉语言模型，而无需依赖扩散语言模型？为此，我们提出DiffusionVL——一类通过高效扩散微调流程将预训练自回归模型转化为扩散范式的视觉语言模型家族。该方法在保持骨干架构不变的前提下，改变了训练目标与解码过程。通过高效的扩散微调策略，我们成功将自回归预训练模型适配至扩散范式。这一方法带来两个关键发现：（1）从基于自回归的多模态模型向扩散范式的范式转换极为有效；（2）直接转换自回归语言模型为扩散视觉语言模型同样可行，其性能可与采用标准自回归视觉指令微调的同架构自回归模型相媲美。为实现实用的开放式生成，我们进一步集成块解码机制，支持任意长度输出及KV缓存重用以加速推理。实验表明，尽管训练数据量不足先前方法的5%，DiffusionVL仍实现了全面的性能提升：在MMMU-Pro（视觉）基准上提升34.4%，在MME（认知）基准上提升37.5%，同时推理速度提升2倍。模型与代码已开源至https://github.com/hustvl/DiffusionVL。