Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented interchangeably for either predicting emotional attributes or recognizing categorical emotions. Achieving such flexibility in a multimodal emotion recognition system is difficult due to the inherent challenges in accurately interpreting and integrating varied data sources. It is also a challenge to robustly handle missing or partial information while allowing direct switch between regression and classification tasks. This study proposes a \emph{versatile audio-visual learning} (VAVL) framework for handling unimodal and multimodal systems for emotion regression and emotion classification tasks. We implement an audio-visual framework that can be trained even when audio and visual paired data is not available for part of the training set (i.e., audio only or only video is present). We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task. Our experimental results reveal that our architecture significantly outperforms strong baselines on both the CREMA-D and MSP-IMPROV corpora. Notably, VAVL attains a new state-of-the-art performance in the emotional attribute prediction task on the MSP-IMPROV corpus. Code available at: https://github.com/ilucasgoncalves/VAVL
翻译:当前大多数视听情感识别模型缺乏实际应用部署所需的灵活性。我们设想一种多模态系统,该系统即使在仅提供单一模态信息时仍能工作,并可互换地用于预测情感属性或识别类别情感。由于准确解读和整合不同数据源存在固有挑战,在情感识别系统中实现此类灵活性较为困难。同时,稳健地处理缺失或部分信息,同时支持回归与分类任务的直接切换也是一项挑战。本研究提出了一种通用的视听学习(VAVL)框架,用于处理单模态与多模态系统在情感回归与情感分类任务中的应用。我们实现了一个视听框架,该框架即使在训练集部分数据缺乏音视频配对(即仅包含音频或仅包含视频)时仍可进行训练。我们通过视听共享层、共享层上的残差连接以及单模态重建任务实现了这种有效的表征学习。实验结果表明,我们的模型在CREMA-D和MSP-IMPROV语料库上均显著优于强基线方法。值得注意的是,VAVL在MSP-IMPROV语料库的情感属性预测任务中达到了新的最优性能。代码地址:https://github.com/ilucasgoncalves/VAVL