Action detection is a challenging video understanding task, requiring modeling spatio-temporal and interaction relations. Current methods usually model actor-actor and actor-context relations separately, ignoring their complementarity and mutual support. To solve this problem, we propose a novel network called Multi-Relation Support Network (MRSN). In MRSN, Actor-Context Relation Encoder (ACRE) and Actor-Actor Relation Encoder (AARE) model the actor-context and actor-actor relation separately. Then Relation Support Encoder (RSE) computes the supports between the two relations and performs relation-level interactions. Finally, Relation Consensus Module (RCM) enhances two relations with the long-term relations from the Long-term Relation Bank (LRB) and yields a consensus. Our experiments demonstrate that modeling relations separately and performing relation-level interactions can achieve and outperformer state-of-the-art results on two challenging video datasets: AVA and UCF101-24.
翻译:动作检测是一项具有挑战性的视频理解任务,需要对时空关系和交互关系进行建模。现有方法通常分别建模演员-演员关系和演员-上下文关系,忽略了它们之间的互补性与相互支持。为解决这一问题,我们提出了一种名为多关系支持网络(MRSN)的新型网络。在MRSN中,演员-上下文关系编码器(ACRE)和演员-演员关系编码器(AARE)分别建模演员-上下文关系和演员-演员关系;随后,关系支持编码器(RSE)计算这两种关系之间的支持度,并执行关系层面的交互;最后,关系一致模块(RCM)利用来自长期关系库(LRB)的长程关系增强这两种关系,并生成一致结果。实验表明,分别建模关系并进行关系层面的交互,能够在两个具有挑战性的视频数据集(AVA和UCF101-24)上达到并超越当前最优性能。