Nowadays, short videos (SVs) are essential to information acquisition and sharing in our life. The prevailing use of SVs to spread emotions leads to the necessity of emotion recognition in SVs. Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos. Meanwhile, we alleviate the impact of subjectivities on labeling quality by emphasizing better personnel allocations and multi-stage annotations. In addition, we provide the category-balanced and test-oriented variants through targeted data sampling. Some commonly used videos (e.g., facial expressions and postures) have been well studied. However, it is still challenging to understand the emotions in SVs. Since the enhanced content diversity brings more distinct semantic gaps and difficulties in learning emotion-related features, and there exists information gaps caused by the emotion incompleteness under the prevalently audio-visual co-expressions. To tackle these problems, we present an end-to-end baseline method AV-CPNet that employs the video transformer to better learn semantically relevant representations. We further design the two-stage cross-modal fusion module to complementarily model the correlations of audio-visual features. The EP-CE Loss, incorporating three emotion polarities, is then applied to guide model optimization. Extensive experimental results on nine datasets verify the effectiveness of AV-CPNet. Datasets and code will be open on https://github.com/XuecWu/eMotions.
翻译:如今,短视频已成为我们生活中信息获取与分享的重要载体。短视频在情感传播中的广泛应用催生了对其情感识别的需求。针对短视频情感数据匮乏的问题,我们构建了包含27,996个视频的大规模数据集eMotions。同时,通过优化人员配置与多阶段标注流程,有效降低了标注质量受主观性的影响。此外,我们通过定向数据采样提供了类别平衡与测试导向的变体数据集。现有研究已对常见视频(如面部表情与姿态)展开充分探讨,但短视频情感理解仍面临挑战:增强的内容多样性导致语义鸿沟更为显著,情感相关特征学习困难;同时,在普遍存在的视听协同表达中,情感不完整性引发了信息鸿沟问题。针对上述问题,我们提出端到端基线方法AV-CPNet,通过视频变换器(Video Transformer)更好地学习语义相关表征,并设计两阶段跨模态融合模块以互补建模视听特征的关联性。进一步引入融合三种情感极性的EP-CE损失函数指导模型优化。在九个数据集上的大量实验验证了AV-CPNet的有效性。数据集与代码将开源至https://github.com/XuecWu/eMotions。