We propose a new multimodal variational autoencoder that enables to generate from the joint distribution and conditionally to any number of complex modalities. The unimodal posteriors are conditioned on the Deep Canonical Correlation Analysis embeddings which preserve the shared information across modalities leading to more coherent cross-modal generations. Furthermore, we use Normalizing Flows to enrich the unimodal posteriors and achieve more diverse data generation. Finally, we propose to use a Product of Experts for inferring one modality from several others which makes the model scalable to any number of modalities. We demonstrate that our method improves likelihood estimates, diversity of the generations and in particular coherence metrics in the conditional generations on several datasets.
翻译:我们提出了一种新的多模态变分自编码器,能够从联合分布进行生成,并条件性地适应任意数量的复杂模态。单模态后验分布基于深度典型相关分析嵌入进行条件化,该嵌入保留了跨模态的共享信息,从而产生更一致的跨模态生成。此外,我们使用归一化流来丰富单模态后验分布,实现更多样化的数据生成。最后,我们提出使用专家乘积法从多个模态推断一个模态,使模型可扩展至任意数量的模态。我们通过实验证明,该方法在多个数据集上改进了似然估计、生成多样性,尤其是条件生成中的一致性指标。