Video-based Emotional Reaction Intensity (ERI) estimation measures the intensity of subjects' reactions to stimuli along several emotional dimensions from videos of the subject as they view the stimuli. We propose a multi-modal architecture for video-based ERI combining video and audio information. Video input is encoded spatially first, frame-by-frame, combining features encoding holistic aspects of the subjects' facial expressions and features encoding spatially localized aspects of their expressions. Input is then combined across time: from frame-to-frame using gated recurrent units (GRUs), then globally by a transformer. We handle variable video length with a regression token that accumulates information from all frames into a fixed-dimensional vector independent of video length. Audio information is handled similarly: spectral information extracted within each frame is integrated across time by a cascade of GRUs and a transformer with regression token. The video and audio regression tokens' outputs are merged by concatenation, then input to a final fully connected layer producing intensity estimates. Our architecture achieved excellent performance on the Hume-Reaction dataset in the ERI Esimation Challenge of the Fifth Competition on Affective Behavior Analysis in-the-Wild (ABAW5). The Pearson Correlation Coefficients between estimated and subject self-reported scores, averaged across all emotions, were 0.455 on the validation dataset and 0.4547 on the test dataset, well above the baselines. The transformer's self-attention mechanism enables our architecture to focus on the most critical video frames regardless of length. Ablation experiments establish the advantages of combining holistic/local features and of multi-modal integration. Code available at https://github.com/HKUST-NISL/ABAW5.
翻译:基于视频的情感反应强度(ERI)估计通过记录被试观看刺激时的视频,衡量其对刺激在多个情感维度上的反应强度。我们提出了一种结合视频和音频信息的多模态架构,用于视频级ERI估计。视频输入首先按帧进行空间编码,融合编码被试面部表情整体方面的特征与编码表情空间局部方面的特征。随后,输入在时间维度上整合:首先通过门控循环单元(GRU)进行帧间处理,再通过Transformer进行全局整合。我们采用回归令牌处理可变视频长度,该令牌将所有帧的信息累积为固定维度的向量,不受视频长度影响。音频信息的处理类似:每帧提取的频谱特征通过GRU级联和含回归令牌的Transformer进行时间整合。视频与音频回归令牌的输出通过拼接融合,输入最终全连接层以生成强度估计值。我们的架构在第五届野外情感行为分析竞赛(ABAW5)的ERI估计挑战中,于Hume-Reaction数据集上取得了优异性能。在所有情感维度上,估计值与被试自我报告分数的皮尔逊相关系数在验证集和测试集上分别为0.455和0.4547,显著高于基线。Transformer的自注意力机制使架构能聚焦于最关键的视频帧,而不受视频长度影响。消融实验证实了整合整体/局部特征以及多模态融合的优势。代码可见于https://github.com/HKUST-NISL/ABAW5。