In this paper, we propose a new, simple, and effective Self-supervised Spatio-temporal Transformers (SPARTAN) approach to Group Activity Recognition (GAR) using unlabeled video data. Given a video, we create local and global Spatio-temporal views with varying spatial patch sizes and frame rates. The proposed self-supervised objective aims to match the features of these contrasting views representing the same video to be consistent with the variations in spatiotemporal domains. To the best of our knowledge, the proposed mechanism is one of the first works to alleviate the weakly supervised setting of GAR using the encoders in video transformers. Furthermore, using the advantage of transformer models, our proposed approach supports long-term relationship modeling along spatio-temporal dimensions. The proposed SPARTAN approach performs well on two group activity recognition benchmarks, including NBA and Volleyball datasets, by surpassing the state-of-the-art results by a significant margin in terms of MCA and MPCA metrics.
翻译:本文提出了一种新颖、简洁且有效的自监督时空Transformer方法(SPARTAN),用于利用未标注视频数据进行群体活动识别(GAR)。对于给定视频,我们通过不同的空间分块大小和帧率创建局部与全局时空视图。所提出的自监督目标旨在使表征同一视频的这些对比视图的特征保持一致,以应对时空域的变化。据我们所知,该机制是首批利用视频Transformer编码器缓解GAR弱监督设置的工作之一。此外,基于Transformer模型优势,我们的方法支持沿时空维度的长期关系建模。所提出的SPARTAN方法在包括NBA和排球数据集在内的两个群体活动识别基准上表现优异,在MCA和MPCA指标上显著超越了现有最优结果。