The study of human emotions, traditionally a cornerstone in fields like psychology and neuroscience, has been profoundly impacted by the advent of artificial intelligence (AI). Multiple channels, such as speech (voice) and facial expressions (image), are crucial in understanding human emotions. However, AI's journey in multimodal emotion recognition (MER) is marked by substantial technical challenges. One significant hurdle is how AI models manage the absence of a particular modality - a frequent occurrence in real-world situations. This study's central focus is assessing the performance and resilience of two strategies when confronted with the lack of one modality: a novel multimodal dynamic modality and view selection and a cross-attention mechanism. Results on the RECOLA dataset show that dynamic selection-based methods are a promising approach for MER. In the missing modalities scenarios, all dynamic selection-based methods outperformed the baseline. The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities.
翻译:人类情感研究历来是心理学和神经科学等领域的基石,而人工智能的出现对其产生了深远影响。语音(声音)和面部表情(图像)等多通道信息对于理解人类情感至关重要。然而,人工智能在多模态情感识别领域的发展之路伴随着显著的技术挑战。其中一大难题在于人工智能模型如何应对特定模态缺失的情况——这在现实场景中频繁发生。本研究的核心在于评估两种策略在单模态缺失情况下的性能与鲁棒性:一种新颖的多模态动态模态与视角选择机制,以及交叉注意力机制。在RECOLA数据集上的实验结果表明,基于动态选择的方法是多模态情感识别领域颇具潜力的解决方案。在模态缺失场景中,所有基于动态选择的方法均优于基线模型。研究最后强调,音频与视频模态在情感预测中存在着复杂的相互作用,充分展现了动态选择方法处理模态缺失问题的适应性。