In video-based emotion recognition (ER), it is important to effectively leverage the complementary relationship among audio (A) and visual (V) modalities, while retaining the intra-modal characteristics of individual modalities. In this paper, a recursive joint attention model is proposed along with long short-term memory (LSTM) modules for the fusion of vocal and facial expressions in regression-based ER. Specifically, we investigated the possibility of exploiting the complementary nature of A and V modalities using a joint cross-attention model in a recursive fashion with LSTMs to capture the intra-modal temporal dependencies within the same modalities as well as among the A-V feature representations. By integrating LSTMs with recursive joint cross-attention, our model can efficiently leverage both intra- and inter-modal relationships for the fusion of A and V modalities. The results of extensive experiments performed on the challenging Affwild2 and Fatigue (private) datasets indicate that the proposed A-V fusion model can significantly outperform state-of-art-methods.
翻译:在基于视频的情感识别中,有效利用音频与视觉模态间的互补关系同时保留各模态内部特征至关重要。本文提出一种递归联合注意力模型,结合长短期记忆模块用于基于回归情感识别中语音与面部表情的融合。具体而言,我们探索了通过递归方式利用联合交叉注意力模型来挖掘音视频模态互补特性的可能性,并融入LSTM以捕获同一模态内及音视频特征表示间的时序依赖关系。通过将LSTM与递归联合交叉注意力相结合,所提模型能够有效利用模态内与模态间关系实现音视频融合。在具有挑战性的Affwild2数据集及私有Fatigue数据集上开展的大量实验结果表明,所提出的音视频融合模型显著优于现有最优方法。