Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research. Multi-modal Variational Autoencoders (VAEs) have been a popular generative model class that learns latent representations that jointly explain multiple modalities. Various objective functions for such models have been suggested, often motivated as lower bounds on the multi-modal data log-likelihood or from information-theoretic considerations. To encode latent variables from different modality subsets, Product-of-Experts (PoE) or Mixture-of-Experts (MoE) aggregation schemes have been routinely used and shown to yield different trade-offs, for instance, regarding their generative quality or consistency across multiple modalities. In this work, we consider a variational bound that can tightly approximate the data log-likelihood. We develop more flexible aggregation schemes that generalize PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks. Our numerical experiments illustrate trade-offs for multi-modal variational bounds and various aggregation schemes. We show that tighter variational bounds and more flexible aggregation models can become beneficial when one wants to approximate the true joint distribution over observed modalities and latent variables in identifiable models.
翻译:在多模态数据上设计深度潜变量模型一直是机器学习研究的长期主题。多模态变分自编码器(VAEs)已成为一类流行的生成模型,它们学习能够共同解释多种模态的潜在表示。针对此类模型,已有多种目标函数被提出,其动机通常源于多模态数据对数似然的下界或信息论考量。为了从不同模态子集编码潜变量,专家乘积(PoE)或专家混合(MoE)聚合方案被广泛使用,并在生成质量或多模态一致性等方面展现出不同的权衡。在本工作中,我们考虑一个能够紧密逼近数据对数似然的变分界。我们开发了更灵活的聚合方案,通过基于置换不变神经网络的编码特征组合,将PoE或MoE方法推广。我们的数值实验阐明了多模态变分界与各种聚合方案间的权衡。研究表明,当需要在可识别模型中逼近观测模态与潜变量上的真实联合分布时,更紧的变分界与更灵活的聚合模型能够发挥优势。