This paper introduces a novel approach to Social Group Activity Recognition (SoGAR) using Self-supervised Transformers network that can effectively utilize unlabeled video data. To extract spatio-temporal information, we created local and global views with varying frame rates. Our self-supervised objective ensures that features extracted from contrasting views of the same video were consistent across spatio-temporal domains. Our proposed approach is efficient in using transformer-based encoders to alleviate the weakly supervised setting of group activity recognition. By leveraging the benefits of transformer models, our approach can model long-term relationships along spatio-temporal dimensions. Our proposed SoGAR method achieved state-of-the-art results on three group activity recognition benchmarks, namely JRDB-PAR, NBA, and Volleyball datasets, surpassing the current numbers in terms of F1-score, MCA, and MPCA metrics.
翻译:本文提出了一种利用自监督Transformer网络进行社交群体活动识别(SoGAR)的新方法,该方法能够有效利用未标注视频数据。为提取时空信息,我们构建了具有不同帧率的局部视图与全局视图。所提出的自监督目标函数确保从同一视频的对比视图中提取的特征在时空域上保持一致。本方法通过高效利用基于Transformer的编码器,缓解了群体活动识别中的弱监督学习问题。借助Transformer模型的优势,我们的方法能够沿时空维度建模长程依赖关系。所提出的SoGAR方法在三个群体活动识别基准数据集(JRDB-PAR、NBA与排球数据集)上取得了最先进的结果,在F1分数、MCA和MPCA指标上均超越了现有最优数值。