Understanding affective dynamics in real-world social systems is fundamental to modeling and analyzing human-human interactions in complex environments. Group affect emerges from intertwined human-human interactions, contextual influences, and behavioral cues, making its quantitative modeling a challenging computational social systems problem. However, computational modeling of group affect in in-the-wild scenarios remains challenging due to limited large-scale annotated datasets and the inherent complexity of multimodal social interactions shaped by contextual and behavioral variability. The lack of comprehensive datasets annotated with multimodal and contextual information further limits advances in the field. To address this, we introduce the Group Affect from ViDeos (GAViD) dataset, comprising 5091 video clips with multimodal data (video, audio and context), annotated with ternary valence and discrete emotion labels and enriched with VideoGPT-generated contextual metadata and human-annotated action cues. We also present Context-Aware Group Affect Recognition Network (CAGNet) for multimodal context-aware group affect recognition. CAGNet achieves 63.20\% test accuracy on GAViD, comparable to state-of-the-art performance. The dataset and code are available at github.com/deepakkumar-iitr/GAViD.
翻译:摘要:理解真实社交系统中的情感动态是建模与分析复杂环境中人际交互的基础。群体情感源于人际互动、情境影响和行为线索的相互交织,其定量建模构成具有挑战性的计算社会系统问题。然而,由于缺乏大规模标注数据集以及受情境和行为变化影响的多模态社交交互的内在复杂性,野外场景下群体情感的计算建模仍面临困难。缺乏包含多模态和情境信息的综合数据集进一步限制了该领域的进展。为此,我们提出视频群体情感(GAViD)数据集,包含5091个视频片段及其多模态数据(视频、音频和情境),标注有三类效价和离散情绪标签,并配有VideoGPT生成的情境元数据和人工标注的行为线索。我们还提出情境感知群体情感识别网络(CAGNet),实现多模态情境感知的群体情感识别。CAGNet在GAViD上达到63.20%的测试准确率,与当前最优性能相当。数据集和代码可在github.com/deepakkumar-iitr/GAViD获取。