We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.
翻译:我们提出LLaDA2.0-Uni,一种统一离散扩散大语言模型(dLLM),支持在原生集成框架内实现多模态理解与生成。其架构融合了全语义离散分词器、基于MoE的dLLM主干网络和扩散解码器。通过SigLIP-VQ将连续视觉输入离散化,模型能在主干网络中实现对文本和视觉输入的块级掩码扩散,同时解码器将视觉标记重建为高保真图像。在并行解码基础上,通过主干网络中的前缀感知优化和解码器中的少步蒸馏进一步提升了推理效率。在大规模精心策划数据和定制化多阶段训练流程的支持下,LLaDA2.0-Uni在多模态理解方面达到甚至媲美专业视觉语言模型(VLM)的水平,同时在图像生成与编辑任务中展现出强劲性能。其对交错生成与推理的原生支持,为下一代统一基础模型建立了充满前景且可扩展的范式。相关代码与模型见https://github.com/inclusionAI/LLaDA2.0-Uni。