Recognizing interactive action plays an important role in human-robot interaction and collaboration. Previous methods use late fusion and co-attention mechanism to capture interactive relations, which have limited learning capability or inefficiency to adapt to more interacting entities. With assumption that priors of each entity are already known, they also lack evaluations on a more general setting addressing the diversity of subjects. To address these problems, we propose an Interactive Spatiotemporal Token Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and interactive relations. Specifically, our network contains a tokenizer to partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to represent motions of multiple diverse entities. By extending the entity dimension, ISTs provide better interactive representations. To jointly learn along three dimensions in ISTs, multi-head self-attention blocks integrated with 3D convolutions are designed to capture inter-token correlations. When modeling correlations, a strict entity ordering is usually irrelevant for recognizing interactive actions. To this end, Entity Rearrangement is proposed to eliminate the orderliness in ISTs for interchangeable entities. Extensive experiments on four datasets verify the effectiveness of ISTA-Net by outperforming state-of-the-art methods. Our code is publicly available at https://github.com/Necolizer/ISTA-Net
翻译:交互动作识别在人机交互与协作中具有重要作用。现有方法采用后期融合与协同注意力机制捕捉交互关系,但存在学习能力有限或难以高效适应多交互实体的问题。由于假设已知各实体先验信息,这些方法在面向主体多样性的通用场景中缺乏评估。为解决上述问题,我们提出交互式时空令牌注意力网络(ISTA-Net),该方法可同时建模空间、时间与交互关系。具体而言,网络包含一个令牌化器,用于划分交互式时空令牌(ISTs),该令牌以统一方式表征多实体的运动信息。通过扩展实体维度,ISTs提供更优的交互表征。为联合学习ISTs的三维信息,我们设计了融合3D卷积的多头自注意力模块以捕捉令牌间关联。在建模关联时,严格实体顺序通常与交互动作识别无关。为此,我们提出实体重排机制以消除可互换实体的ISTs顺序性。在四个数据集上的大量实验表明,ISTA-Net以超越现有最优方法的性能验证了其有效性。我们的代码已开源至https://github.com/Necolizer/ISTA-Net