从信息论视角分析变分自编码器中的多模态融合 (Analyzing Multimodal Integration in the Variational Autoencoder from an Information-Theoretic Perspective)

Human perception is inherently multimodal. We integrate, for instance, visual, proprioceptive and tactile information into one experience. Hence, multimodal learning is of importance for building robotic systems that aim at robustly interacting with the real world. One potential model that has been proposed for multimodal integration is the multimodal variational autoencoder. A variational autoencoder (VAE) consists of two networks, an encoder that maps the data to a stochastic latent space and a decoder that reconstruct this data from an element of this latent space. The multimodal VAE integrates inputs from different modalities at two points in time in the latent space and can thereby be used as a controller for a robotic agent. Here we use this architecture and introduce information-theoretic measures in order to analyze how important the integration of the different modalities are for the reconstruction of the input data. Therefore we calculate two different types of measures, the first type is called single modality error and assesses how important the information from a single modality is for the reconstruction of this modality or all modalities. Secondly, the measures named loss of precision calculate the impact that missing information from only one modality has on the reconstruction of this modality or the whole vector. The VAE is trained via the evidence lower bound, which can be written as a sum of two different terms, namely the reconstruction and the latent loss. The impact of the latent loss can be weighted via an additional variable, which has been introduced to combat posterior collapse. Here we train networks with four different weighting schedules and analyze them with respect to their capabilities for multimodal integration.

翻译：人类感知本质上是多模态的。例如，我们将视觉、本体感觉和触觉信息整合为统一体验。因此，多模态学习对于构建能够与现实世界稳健交互的机器人系统至关重要。多模态变分自编码器是一种被提出的多模态融合潜在模型。变分自编码器（VAE）由两个网络组成：编码器将数据映射到随机潜在空间，解码器则从该潜在空间的元素中重构数据。多模态VAE在潜在空间的两个时间点整合来自不同模态的输入，因而可作为机器人智能体的控制器。本文采用该架构并引入信息论度量，以分析不同模态的融合对输入数据重构的重要性。为此我们计算两类度量：第一类称为单模态误差，用于评估单一模态信息对该模态或所有模态重构的重要性；第二类称为精度损失度量，用于计算仅缺失单一模态信息时对该模态或整体向量重构的影响。VAE通过证据下界进行训练，该下界可表述为重构损失与潜在损失两项之和。潜在损失的影响可通过附加变量进行加权，该变量是为应对后验坍塌问题而引入的。本文采用四种不同加权策略训练网络，并分析其多模态融合能力。