Modelling Spatio-Temporal Interactions for Compositional Action Recognition

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed. Humans can abstract away the action from the appearance of the objects and their context which is referred to as compositionality of actions. Compositional action recognition deals with imparting human-like compositional generalization abilities to action-recognition models. In this regard, extracting the interactions between humans and objects forms the basis of compositional understanding. These interactions are not affected by the appearance biases of the objects or the context. But the context provides additional cues about the interactions between things and stuff. Hence we need to infuse context into the human-object interactions for compositional action recognition. To this end, we first design a spatial-temporal interaction encoder that captures the human-object (things) interactions. The encoder learns the spatio-temporal interaction tokens disentangled from the background context. The interaction tokens are then infused with contextual information from the video tokens to model the interactions between things and stuff. The final context-infused spatio-temporal interaction tokens are used for compositional action recognition. We show the effectiveness of our interaction-centric approach on the compositional Something-Else dataset where we obtain a new state-of-the-art result of 83.8% top-1 accuracy outperforming recent important object-centric methods by a significant margin. Our approach of explicit human-object-stuff interaction modeling is effective even for standard action recognition datasets such as Something-Something-V2 and Epic-Kitchens-100 where we obtain comparable or better performance than state-of-the-art.

翻译：人类具有天然的动作识别能力，即使动作涉及的物体或背景发生变化，也能正确识别。人类能够从物体外观及其上下文中抽象出动作本身，这被称为动作的组合性。组合动作识别旨在赋予动作识别模型类似人类的组合泛化能力。为此，提取人-物交互是组合理解的基础，这些交互不受物体外观偏差或背景影响。但背景为物体与场景之间的交互提供了额外线索，因此我们需要将背景融入人-物交互中以实现组合动作识别。针对这一目标，我们首先设计了一个时空交互编码器，用于捕获人与物体（实体）之间的交互。该编码器学习与背景上下文解耦的时空交互标记。随后，这些交互标记与视频标记中的上下文信息融合，以建模实体与场景之间的交互。最终，融合上下文的时空交互标记被用于组合动作识别。我们在组合型Something-Else数据集上验证了以交互为中心的方法的有效性，获得了83.8%的top-1准确率，显著超越了近期重要的以物体为中心的方法。我们提出的显式人-物-场景交互建模方法对标准动作识别数据集（如Something-Something-V2和Epic-Kitchens-100）同样有效，在这些数据集上取得了与最新技术相当或更优的性能。