UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.

翻译：统一多模态模型（UMMs）已成为通用多模态智能的关键方向，将理解与生成任务整合至单一框架。然而，现有UMMs面临显著挑战：（1）视觉理解与生成任务之间存在固有学习冲突，导致两者建模效果欠佳；（2）理解与生成的视觉空间不统一，制约了模型可扩展性；（3）过度依赖任务特定数据，忽视了文本-图像理解与生成的双向互补特性。为解决上述问题，我们提出UniDDT模型，采用带噪ViT编码器与LLM协同实现视觉生成与理解任务的语义编码统一，同时利用独立扩散解码器将扩散解码与文本解码解耦。通过这种带噪ViT编码器，UniDDT能够以潜在空间作为统一视觉表征，实现理解与生成任务无缝兼容。由此，生成任务的可扩展性与理解任务的语义表达能力得以平衡。此外，我们从相同图像-文本对中构建双数据结构，强化生成数据与理解数据的相互依赖关系，以挖掘其内在双向性。大量实验表明，UniDDT在提升语义一致性与可扩展性的同时，实现了多模态理解与生成的有效统一。在视觉生成任务中，UniDDT取得0.87 GenEval分数及86.9 DPG总分；在多模态理解任务中，UniDDT在MME基准测试中获得1699.5分，在SEEDbench上达到76.5总分。