Previous group activity recognition approaches were limited to reasoning using human relations or finding important subgroups and tended to ignore indispensable group composition and human-object interactions. This absence makes a partial interpretation of the scene and increases the interference of irrelevant actions on the results. Therefore, we propose our DynamicFormer with Dynamic composition Module (DcM) and Dynamic interaction Module (DiM) to model relations and locations of persons and discriminate the contribution of participants, respectively. Our findings on group composition and human-object interaction inspire our core idea. Group composition tells us the location of people and their relations inside the group, while interaction reflects the relation between humans and objects outside the group. We utilize spatial and temporal encoders in DcM to model our dynamic composition and build DiM to explore interaction with a novel GCN, which has a transformer inside to consider the temporal neighbors of human/object. Also, a Multi-level Dynamic Integration is employed to integrate features from different levels. We conduct extensive experiments on two public datasets and show that our method achieves state-of-the-art.
翻译:以往的群体活动识别方法局限于利用人际关系进行推理或寻找重要子群体,往往忽略了不可或缺的群体构成及人与物体的交互。这种缺失导致对场景的片面解读,并增加了无关动作对结果的干扰。为此,我们提出了带有动态构成模块(DcM)和动态交互模块(DiM)的DynamicFormer,分别用于建模人员的位置与关系以及区分参与者的贡献。我们对群体构成和人与物体交互的发现启发了核心思想:群体构成揭示了人员位置及其在群体内部的关系,而交互则反映了人与群体外部物体之间的关系。我们在DcM中利用时空编码器对动态构成进行建模,并构建DiM以通过新型图卷积网络(GCN)探索交互,该网络内部集成Transformer以考虑人物/物体的时间邻域。此外,我们采用多级动态整合来融合不同层次的特征。在两个公开数据集上进行的大量实验表明,我们的方法达到了最先进的性能。