Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research. Multi-modal Variational Autoencoders (VAEs) have been a popular generative model class that learns latent representations which jointly explain multiple modalities. Various objective functions for such models have been suggested, often motivated as lower bounds on the multi-modal data log-likelihood or from information-theoretic considerations. In order to encode latent variables from different modality subsets, Product-of-Experts (PoE) or Mixture-of-Experts (MoE) aggregation schemes have been routinely used and shown to yield different trade-offs, for instance, regarding their generative quality or consistency across multiple modalities. In this work, we consider a variational bound that can tightly lower bound the data log-likelihood. We develop more flexible aggregation schemes that generalise PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks. Our numerical experiments illustrate trade-offs for multi-modal variational bounds and various aggregation schemes. We show that tighter variational bounds and more flexible aggregation models can become beneficial when one wants to approximate the true joint distribution over observed modalities and latent variables in identifiable models.
翻译:为多模态数据设计深度潜在变量模型一直是机器学习研究中的一个长期主题。多模态变分自编码器(VAEs)已成为一种流行的生成模型类别,它学习能够共同解释多种模态的潜在表示。针对此类模型,研究者提出了多种目标函数,这些函数通常基于多模态数据对数似然的下界或信息论考量。为了从不同模态子集中编码潜在变量,经常采用专家乘积(PoE)或专家混合(MoE)聚合方案,并显示它们在生成质量或多模态一致性等方面会带来不同权衡。在本工作中,我们考虑一种能够紧密下界数据对数似然的变分界。我们开发了更灵活的聚合方案,通过基于置换不变神经网络的编码特征组合,将不同模态的信息加以整合,从而推广了PoE或MoE方法。数值实验展示了多模态变分界及多种聚合方案间的权衡。结果表明,当在可辨识模型中希望逼近观测模态与潜在变量上的真实联合分布时,更紧的变分边界和更灵活的聚合模型可能变得有利。