Multi-modal data-sets are ubiquitous in modern applications, and multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities. However, existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities, and vice versa. We discuss the limitations underlying the unsatisfactory performance of existing methods, to motivate the need for a different approach. We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders. Individual latent variables are concatenated into a common latent space, which is fed to a masked diffusion model to enable generative modeling. We also introduce a new multi-time training method to learn the conditional score network for multi-modal diffusion. Our methodology substantially outperforms competitors in both generation quality and coherence, as shown through an extensive experimental campaign.
翻译:多模态数据集在现代应用中无处不在,多模态变分自编码器作为一类旨在学习不同模态联合表示的模型家族广受欢迎。然而现有方法存在一致性-质量权衡问题:生成质量良好的模型缺乏跨模态生成一致性,反之亦然。我们深入探讨了现有方法性能不佳的内在局限性,以此论证新方法的必要性。本文提出一种创新方法,采用一组独立训练的单模态确定性自编码器。各个潜在变量被拼接至公共潜在空间,并输入掩码扩散模型以实现生成建模。我们还引入新型多时间训练方法,用于学习多模态扩散的条件评分网络。通过系统性实验表明,本方法在生成质量与一致性方面均显著优于现有方案。