In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.
翻译:本文提出了一种多模态动态变分自编码器(MDVAE),并将其应用于无监督的视听语音表示学习。其潜在空间被结构化,以分离模态间共享的潜在动态因子与各模态特有的潜在动态因子。此外,还引入了一个静态潜在变量,用于编码视听语音序列中随时间保持不变的信息。该模型在视听情感语音数据集上以无监督方式分两阶段训练:第一阶段,独立学习每个模态的向量量化变分自编码器(VQ-VAE),不进行时间建模;第二阶段,在VQ-VAE的量化前中间表示上学习MDVAE模型。静态与动态信息、模态特有与模态共有信息的解耦发生在第二阶段训练中。通过大量实验探究视听语音潜在因子在MDVAE潜在空间中的编码方式,包括视听语音操控、视听面部图像降噪以及视听语音情感识别。结果表明,MDVAE能有效融合音频与视觉信息于其潜在空间中。同时,学习到的静态视听语音表示可用于小样本情感识别,且与单模态基线及基于视听Transformer架构的最先进监督模型相比,具有更高的准确率。