Multimodal Variational Autoencoders (VAEs) represent a promising group of generative models that facilitate the construction of a tractable posterior within the latent space given multiple modalities. Previous studies have shown that as the number of modalities increases, the generative quality of each modality declines. In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of independently trained unimodal VAEs using score-based models (SBMs). The role of the SBM is to enforce multimodal coherence by learning the correlation among the latent variables. Consequently, our model combines a better generative quality of unimodal VAEs with coherent integration across different modalities using the latent score-based model. In addition, our approach provides the best unconditional coherence.
翻译:多模态变分自编码器(VAEs)是一类具有前景的生成模型,能够在给定多个模态的情况下,于潜在空间中构建可处理的后验分布。先前研究表明,随着模态数量的增加,各模态的生成质量会下降。在本研究中,我们探索了一种替代方法,通过使用基于分数的模型(SBMs)对独立训练的单模态VAEs的潜在空间进行联合建模,以提升多模态VAEs的生成性能。SBM的作用是通过学习潜在变量之间的相关性来强制实现多模态一致性。因此,我们的模型结合了单模态VAEs更优的生成质量与基于分数的潜在模型对不同模态的一致性整合能力。此外,本方法提供了最佳的无条件一致性。