Multimodal Variational Autoencoders (VAEs) have been the subject of intense research in the past years as they can integrate multiple modalities into a joint representation and can thus serve as a promising tool for both data classification and generation. Several approaches toward multimodal VAE learning have been proposed so far, their comparison and evaluation have however been rather inconsistent. One reason is that the models differ at the implementation level, another problem is that the datasets commonly used in these cases were not initially designed to evaluate multimodal generative models. This paper addresses both mentioned issues. First, we propose a toolkit for systematic multimodal VAE training and comparison. The toolkit currently comprises 4 existing multimodal VAEs and 6 commonly used benchmark datasets along with instructions on how to easily add a new model or a dataset. Second, we present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities across multiple difficulty levels. We demonstrate the utility of our dataset by comparing the implemented state-of-the-art models.
翻译:多模态变分自编码器(VAE)近年来成为研究热点,因其能将多种模态整合至联合表征中,从而成为数据分类与生成领域极具潜力的工具。目前虽已提出多种多模态VAE学习方法,但其比较与评估工作却长期缺乏一致性。一方面源于模型在实现层面的差异,另一方面则在于常用数据集初始设计目的并非用于评估多模态生成模型。本文同时针对上述两个问题展开工作:首先,我们提出一个用于系统性训练与比较多模态VAE的工具包。该工具包目前集成4种现有经典多模态VAE模型与6个常用基准数据集,并提供便捷添加新模型或数据集的指南。其次,我们提出一个解耦双模态数据集,旨在从多个难度层级全面评估联合生成与交叉生成能力。通过对比已实现的最优模型,我们验证了该数据集的实用价值。