This work addresses the problem of Social Activity Recognition (SAR), a critical component in real-world tasks like surveillance and assistive robotics. Unlike traditional event understanding approaches, SAR necessitates modeling individual actors' appearance and motions and contextualizing them within their social interactions. Traditional action localization methods fall short due to their single-actor, single-action assumption. Previous SAR research has relied heavily on densely annotated data, but privacy concerns limit their applicability in real-world settings. In this work, we propose a self-supervised approach based on multi-actor predictive learning for SAR in streaming videos. Using a visual-semantic graph structure, we model social interactions, enabling relational reasoning for robust performance with minimal labeled data. The proposed framework achieves competitive performance on standard group activity recognition benchmarks. Evaluation on three publicly available action localization benchmarks demonstrates its generalizability to arbitrary action localization.
翻译:本研究致力于解决社交活动识别问题,这是监控与辅助机器人等现实任务中的关键环节。与传统事件理解方法不同,SAR需要建模个体参与者的外观与运动特征,并将其置于社交互动语境中进行理解。传统动作定位方法因基于单参与者-单动作假设而存在局限。既往SAR研究严重依赖密集标注数据,但隐私问题限制了其在真实场景中的应用。本文提出一种基于多参与者预测学习的自监督方法,用于流视频中的SAR任务。通过视觉-语义图结构建模社交互动,实现关系推理,从而在有限标注数据下获得鲁棒性能。所提框架在标准群体活动识别基准测试中取得具有竞争力的性能。在三个公开可用的动作定位基准上的评估结果验证了其对任意动作定位任务的泛化能力。