Weakly-Supervised Group Activity Recognition (WSGAR) aims to understand the activity performed together by a group of individuals with the video-level label and without actor-level labels. We propose Flow-Assisted Motion Learning Network (Flaming-Net) for WSGAR, which consists of the motion-aware actor encoder to extract actor features and the two-pathways relation module to infer the interaction among actors and their activity. Flaming-Net leverages an additional optical flow modality in the training stage to enhance its motion awareness when finding locally active actors. The first pathway of the relation module, the actor-centric path, initially captures the temporal dynamics of individual actors and then constructs inter-actor relationships. In parallel, the group-centric path starts by building spatial connections between actors within the same timeframe and then captures simultaneous spatio-temporal dynamics among them. We demonstrate that Flaming-Net achieves new state-of-the-art WSGAR results on two benchmarks, including a 2.8%p higher MPCA score on the NBA dataset. Importantly, we use the optical flow modality only for training and not for inference.
翻译:弱监督群体活动识别(WSGAR)旨在仅利用视频级标签、无需个体参与者级标签的情况下,理解群体共同执行的活动。我们提出了用于WSGAR的流辅助运动学习网络(Flaming-Net),该网络包含用于提取参与者特征的运动感知参与者编码器,以及用于推断参与者间交互及其活动的双通路关系模块。Flaming-Net在训练阶段利用额外的光流模态,以增强其在定位局部活跃参与者时的运动感知能力。关系模块的第一条通路,即参与者中心通路,首先捕捉个体参与者的时序动态,进而构建参与者间关系。与之并行,群体中心通路则首先在同一时间帧内建立参与者间的空间连接,随后捕捉其间的同步时空动态。我们证明,Flaming-Net在两个基准测试上取得了新的最先进的WSGAR结果,其中在NBA数据集上的MPCA分数提升了2.8%。重要的是,我们仅在训练阶段使用光流模态,推理阶段则无需使用。