In this paper, we propose a new, simple, and effective Self-supervised Spatio-temporal Transformers (SPARTAN) approach to Group Activity Recognition (GAR) using unlabeled video data. Given a video, we create local and global Spatio-temporal views with varying spatial patch sizes and frame rates. The proposed self-supervised objective aims to match the features of these contrasting views representing the same video to be consistent with the variations in spatiotemporal domains. To the best of our knowledge, the proposed mechanism is one of the first works to alleviate the weakly supervised setting of GAR using the encoders in video transformers. Furthermore, using the advantage of transformer models, our proposed approach supports long-term relationship modeling along spatio-temporal dimensions. The proposed SPARTAN approach performs well on two group activity recognition benchmarks, including NBA and Volleyball datasets, by surpassing the state-of-the-art results by a significant margin in terms of MCA and MPCA metrics.
翻译:摘要:本文提出一种新颖、简洁且有效的自监督时空Transformer(SPARTAN)方法,用于利用未标注视频数据进行群体活动识别(GAR)。对于给定视频,我们通过不同空间块尺寸和帧率创建局部与全局时空视图。所提出的自监督学习目标旨在匹配同一视频中这些对比视图的特征,使其在时空域变化下保持一致性。据我们所知,该机制是首批利用视频Transformer编码器缓解GAR弱监督设定的工作之一。此外,借助Transformer模型的优势,我们的方法支持沿时空维度的长期关系建模。所提出的SPARTAN方法在两个群体活动识别基准(包括NBA和排球数据集)上表现优异,在MCA和MPCA指标上以显著优势超越当前最优结果。