Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $α=0.5$, which corresponds to the unique symmetric member of the $α\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.
翻译:多模态变分自编码器(VAEs)被广泛用于多模态弱监督生成学习。主流方法通过专家乘积(PoE)、专家混合(MoE)或其组合聚合单模态推断分布,以近似联合后验分布。本研究从概率意见池化的优化视角重新审视多模态推断。我们以对应$α\text{-散度}$族中唯一对称成员的$α=0.5$的Hölder池化为起点,推导出一种称为Hellinger的矩匹配近似方法。进而利用该近似提出HELVAE——一种避免子采样的高效多模态VAE模型,其具有以下特性:(i) 随观测模态增加可学习更具表达力的隐表示;(ii) 在生成一致性与质量之间实现更优权衡,性能超越现有最优多模态VAE模型。