Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline

Nowadays, short-form videos (SVs) are essential to web information acquisition and sharing in our daily life. The prevailing use of SVs to spread emotions leads to the necessity of conducting video emotion analysis (VEA) towards SVs. Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos. Meanwhile, we alleviate the impact of subjectivities on labeling quality by emphasizing better personnel allocations and multi-stage annotations. In addition, we provide the category-balanced and test-oriented variants through targeted data sampling. Some commonly used videos, such as facial expressions, have been well studied. However, it is still challenging to analysis the emotions in SVs. Since the broader content diversity brings more distinct semantic gaps and difficulties in learning emotion-related features, and there exists local biases and collective information gaps caused by the emotion inconsistence under the prevalently audio-visual co-expressions. To tackle these challenges, we present an end-to-end audio-visual baseline AV-CANet which employs the video transformer to better learn semantically relevant representations. We further design the Local-Global Fusion Module to progressively capture the correlations of audio-visual features. The EP-CE Loss is then introduced to guide model optimization. Extensive experimental results on seven datasets demonstrate the effectiveness of AV-CANet, while providing broad insights for future works. Besides, we investigate the key components of AV-CANet by ablation studies. Datasets and code will be fully open soon.

翻译：如今，短视频已成为日常生活中获取和分享网络信息的重要形式。短视频在情感传播中的普遍应用使得针对短视频的视频情感分析变得尤为必要。考虑到当前缺乏短视频情感数据，我们引入了一个名为eMotions的大规模数据集，包含27,996个视频。同时，我们通过优化人员配置和多阶段标注流程，有效降低了主观性对标注质量的影响。此外，我们通过针对性数据采样提供了类别均衡和测试导向的数据集变体。尽管如面部表情等常见视频类型已得到充分研究，但短视频中的情感分析仍面临挑战。更广泛的内容多样性导致更显著的语义鸿沟，增加了情感相关特征的学习难度；同时，普遍存在的视听共现现象所引发的情感不一致性，会导致局部偏差和集体信息缺失。为应对这些挑战，我们提出了一种端到端的视听基准模型AV-CANet，该模型采用视频Transformer以更好地学习语义相关表征。我们进一步设计了局部-全局融合模块，逐步捕捉视听特征间的关联性，并引入EP-CE损失函数指导模型优化。在七个数据集上的大量实验结果表明了AV-CANet的有效性，同时为未来研究提供了广泛启示。此外，我们通过消融实验探究了AV-CANet的关键组件。数据集与代码即将全面开源。