Multimodal emotion recognition (MER) in practical scenarios is significantly challenged by the presence of missing or incomplete data across different modalities. To overcome these challenges, researchers have aimed to simulate incomplete conditions during the training phase to enhance the system's overall robustness. Traditional methods have often involved discarding data or substituting data segments with zero vectors to approximate these incompletenesses. However, such approaches neither accurately represent real-world conditions nor adequately address the issue of noisy data availability. For instance, a blurry image cannot be simply replaced with zero vectors, and still retain information. To tackle this issue and develop a more precise MER system, we introduce a novel noise-robust MER model that effectively learns robust multimodal joint representations from noisy data. This approach includes two pivotal components: firstly, a noise scheduler that adjusts the type and level of noise in the data to emulate various realistic incomplete situations. Secondly, a Variational AutoEncoder (VAE)-based module is employed to reconstruct these robust multimodal joint representations from the noisy inputs. Notably, the introduction of the noise scheduler enables the exploration of an entirely new type of incomplete data condition, which is impossible with existing methods. Extensive experimental evaluations on the benchmark datasets IEMOCAP and CMU-MOSEI demonstrate the effectiveness of the noise scheduler and the excellent performance of our proposed model.
翻译:多模态情感识别(MER)在实际场景中面临不同模态数据缺失或不完整带来的显著挑战。为克服这些挑战,研究者们力图在训练阶段模拟不完整条件以提升系统整体鲁棒性。传统方法常通过丢弃数据或使用零向量替换数据片段来近似模拟这种不完整性。然而,此类方法既无法准确表征真实场景条件,也难以妥善处理噪声数据的可用性问题——例如,模糊图像不能被简单替换为零向量,因其仍保留着信息。针对这一问题并构建更精确的MER系统,我们提出了一种新颖的噪声鲁棒MER模型,能够从含噪声数据中有效学习鲁棒的多模态联合表征。该方案包含两个关键组件:首先,设计噪声调度器用以调整数据中的噪声类型与强度,模拟各类真实不完整场景;其次,采用基于变分自编码器(VAE)的模块从噪声输入中重建这些鲁棒的多模态联合表征。值得注意的是,噪声调度器的引入使我们能够探索现有方法无法实现的、全新类型的不完整数据条件。在基准数据集IEMOCAP和CMU-MOSEI上进行的大量实验评估,验证了噪声调度器的有效性及所提模型的卓越性能。