In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.
翻译:本文提出了一种多模态动态变分自编码器(MDVAE),应用于无监督的音视频语音表示学习。其潜空间经过结构化设计,将模态间共享的潜在动态因子与各模态特有的潜在动态因子进行分离。同时引入静态潜变量,用于编码音视频语音序列中随时间保持恒定的信息。该模型在音视频情感语音数据集上以无监督方式进行两阶段训练:第一阶段,为每种模态独立学习一个矢量量化变分自编码器(VQ-VAE),不包含时间建模;第二阶段,在VQ-VAE的量化前中间表示上学习MDVAE模型,静态与动态信息、模态共有与特有信息的解耦发生在这一阶段。通过大量实验探究音视频语音潜在因子在MDVAE潜空间中的编码方式,实验包括音视频语音操控、音视频人脸图像去噪及音视频语音情感识别。结果表明,MDVAE在其潜空间中有效融合了音频与视觉信息,且学习到的音视频语音静态表示可用于小样本情感识别,其准确率优于单模态基线方法及基于音视频Transformer架构的当前最优监督模型。