The task of Group Activity Recognition (GAR) aims to predict the activity category of the group by learning the actor spatial-temporal interaction relation in the group. Therefore, an effective actor relation learning method is crucial for the GAR task. The previous works mainly learn the interaction relation by the well-designed GCNs or Transformers. For example, to infer the actor interaction relation, GCNs need a learnable adjacency, and Transformers need to calculate the self-attention. Although the above methods can model the interaction relation effectively, they also increase the complexity of the model (the number of parameters and computations). In this paper, we design a novel MLP-based method for Actor Interaction Relation learning (MLP-AIR) in GAR. Compared with GCNs and Transformers, our method has a competitive but conceptually and technically simple alternative, significantly reducing the complexity. Specifically, MLP-AIR includes three sub-modules: MLP-based Spatial relation modeling module (MLP-S), MLP-based Temporal relation modeling module (MLP-T), and MLP-based Relation refining module (MLP-R). MLP-S is used to model the spatial relation between different actors in each frame. MLP-T is used to model the temporal relation between different frames for each actor. MLP-R is used further to refine the relation between different dimensions of relation features to improve the feature's expression ability. To evaluate the MLP-AIR, we conduct extensive experiments on two widely used benchmarks, including the Volleyball and Collective Activity datasets. Experimental results demonstrate that MLP-AIR can get competitive results but with low complexity.
翻译:组行为识别(GAR)任务旨在通过学习组内参与者的时空交互关系来预测群体的活动类别。因此,有效的参与者关系学习方法对GAR任务至关重要。以往研究主要通过精心设计的GCN或Transformer来学习交互关系。例如,为推断参与者交互关系,GCN需要可学习的邻接矩阵,而Transformer需计算自注意力机制。尽管上述方法能有效建模交互关系,但也增加了模型复杂度(参数数量和计算量)。本文提出一种新颖的基于MLP的组行为识别中参与者交互关系学习方法(MLP-AIR)。与GCN和Transformer相比,本方法在概念和技术上更简洁且具有竞争力的替代方案,显著降低了复杂度。具体而言,MLP-AIR包含三个子模块:基于MLP的空间关系建模模块(MLP-S)、基于MLP的时间关系建模模块(MLP-T)和基于MLP的关系精炼模块(MLP-R)。MLP-S用于建模每帧中不同参与者之间的空间关系,MLP-T用于建模每个参与者在各帧间的时间关系,MLP-R则进一步精炼关系特征不同维度之间的关系以提升特征表达能力。为评估MLP-AIR性能,我们在Volleyball和Collective Activity两个广泛采用的基准数据集上进行了大量实验。实验结果表明,MLP-AIR能以低复杂度获得具有竞争力的结果。