The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbf{Variable-Rate Noise Schedule}, conditioned on these evolving semantic priors. Crucially, we introduce a \textbf{Stochastic Mixed-Modal Transport} strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
翻译:生成建模在离散数据(文本)的自回归方法与连续数据(图像)的扩散方法之间的分叉,阻碍了真正统一的多模态系统的发展。虽然掩码语言模型(MLMs)提供了高效的双向上下文建模能力,但传统上它们缺乏自回归模型的生成保真度以及扩散模型的语义连续性。此外,将掩码生成扩展到多模态场景会引入严重的对齐挑战和训练不稳定性。在本工作中,我们提出了 \textbf{CoM-DAD}(\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion),一种新颖的概率框架,将多模态生成重新表述为一个层次化的双过程。CoM-DAD 将高层语义规划与低层标记合成解耦。首先,我们通过一个连续的潜在扩散过程对语义流形进行建模;其次,我们将标记生成视为一个离散的吸收扩散过程,该过程受一个\textbf{可变速率噪声调度}的调节,并以这些不断演化的语义先验为条件。关键的是,我们引入了一种\textbf{随机混合模态传输}策略,该策略能够对齐不同的模态,而无需依赖计算密集的对比式双编码器。我们的方法相较于标准掩码建模展现出卓越的稳定性,为可扩展、统一的文本-图像生成建立了一个新范式。