MUNI: Multimodal Unified Latent Diffusion for Coherent Any-to-Any Generation

We introduce MUNI, an end-to-end multimodal latent diffusion framework for any-to-any generation that unifies subset-conditioned cross-modal generation and unconditional joint sampling through a shared stochastic latent. Existing multimodal generative models are largely LLM-based, which limits leveraging modality-specific generators and requires text-paired data for training. Recent diffusion- and flow-based any-to-any extensions take a different direction but still rely on text-aligned embeddings, fully-paired training, or matched-dimensionality deterministic mappings. MUNI rests on two complementary contributions, one architectural and one in the training objective. First, we extend latent diffusion to multimodal any-to-any generation end-to-end: instead of the standard two-stage recipe that precomputes a frozen latent space and then fits a prior over it, MUNI jointly trains modality-specific encoders, expressive decoders, and a single shared flow-based prior under one objective. Second, we identify that the standard aggregation rules of multimodal variational inference are insufficient once coupled with a learned prior and expressive decoders. A suitable shared latent must simultaneously satisfy coherence across generated modalities, predictive sufficiency of subset latents, and minimality of the latent content. We propose a routed training objective whose structural choices align the latent with these criteria and admit a minimal-sufficiency characterization in the realizable setting. Experiments on PolyMNIST-Quadrant-Labels and a large-scale image-text-audio benchmark show MUNI matching or exceeding the strongest baselines on conditional generation while opening its largest margins on unconditional coherence. Project page: https://muni-proj.github.io/.

翻译：我们提出MUNI，一种端到端的多模态潜在扩散框架，用于实现任意到任意生成。该框架通过一个共享的随机潜在变量，统一了子集条件跨模态生成与无条件联合采样。现有的大规模多模态生成模型多基于语言模型架构，这不仅受限于特定模态生成器的利用，还需要文本配对数据进行训练。近期基于扩散和流的任意到任意扩展方法采取了不同路径，但仍依赖文本对齐的嵌入表示、完全配对训练或维度匹配的确定性映射。MUNI的核心贡献体现在两个互补方面：架构设计与训练目标设计。首先，我们将潜在扩散扩展为端到端的多模态任意到任意生成：不同于传统两阶段方案（先预计算冻结的潜在空间，再在其上拟合先验分布），MUNI在单一优化目标下联合训练模态专用编码器、高表达力解码器以及一个共享的基于流的先验模型。其次，我们发现当与学习的先验和高表达力解码器结合时，多模态变分推理的标准聚合规则存在不足。一个合适的共享潜在变量必须同时满足生成模态间的连贯性、子集潜在变量的预测充分性以及潜在内容的极小性。我们提出一种路由式训练目标，其结构选择使潜在变量符合上述准则，并在可实现设定下具有极小充分性特征。在PolyMNIST-Quadrant-Labels数据集和大型图像-文本-音频基准上的实验表明，MUNI在条件生成任务上达到或超越最强基线，且在无条件生成连贯性方面创造了最大性能优势。项目主页：https://muni-proj.github.io/。