Human emotions entail a complex set of behavioral, physiological and cognitive changes. Current state-of-the-art models fuse the behavioral and physiological components using classic machine learning, rather than recent deep learning techniques. We propose to fill this gap, designing the Multimodal for Video and Physio (MVP) architecture, streamlined to fuse video and physiological signals. Differently then others approaches, MVP exploits the benefits of attention to enable the use of long input sequences (1-2 minutes). We have studied video and physiological backbones for inputting long sequences and evaluated our method with respect to the state-of-the-art. Our results show that MVP outperforms former methods for emotion recognition based on facial videos, EDA, and ECG/PPG.
翻译:人类情感涉及一系列复杂的行为、生理与认知变化。当前最先进的模型通常采用经典机器学习方法融合行为与生理成分,而非利用近期的深度学习技术。为填补这一空白,我们提出了专为融合视频与生理信号而设计的“视频-生理多模态”(MVP)架构。与现有方法不同,MVP充分利用注意力机制的优势,使其能够处理长时输入序列(1-2分钟)。我们研究了适用于长序列输入的视频与生理特征提取主干网络,并针对现有最优方法评估了本方法的性能。实验结果表明,在基于面部视频、皮肤电活动(EDA)以及心电图/光电容积脉搏波(ECG/PPG)的情感识别任务中,MVP的性能优于以往方法。