While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.
翻译:尽管近期的多模态大语言模型(MLLMs)取得了令人瞩目的进展,但它们主要采用传统的自回归架构作为其主干,在架构设计方面探索更有效、更高效的替代方案仍有很大空间。与此同时,近期研究已成功将离散扩散模型应用于视觉理解和图像生成等多个领域,揭示了其作为多模态系统主干架构的巨大潜力。受这些开创性研究的启发,我们提出了Omni-Diffusion——首个完全基于掩码离散扩散模型构建的任意模态到任意模态的多模态语言模型,它统一了文本、语音和图像的理解与生成。Omni-Diffusion采用统一的基于掩码的离散扩散模型来直接捕获离散多模态令牌上的联合分布。这种方法不仅支持双模态任务,还能处理涉及多种模态的更复杂场景。在多样化的基准测试集上,我们的方法在处理两种或更多模态时,其性能优于或与现有的多模态系统相当,凸显了扩散模型在驱动下一代多模态基础模型方面的巨大潜力。项目网页:https://omni-diffusion.github.io。