The relation modeling between actors and scene context advances video action detection where the correlation of multiple actors makes their action recognition challenging. Existing studies model each actor and scene relation to improve action recognition. However, the scene variations and background interference limit the effectiveness of this relation modeling. In this paper, we propose to select actor-related scene context, rather than directly leverage raw video scenario, to improve relation modeling. We develop a Cycle Actor-Context Relation network (CycleACR) where there is a symmetric graph that models the actor and context relations in a bidirectional form. Our CycleACR consists of the Actor-to-Context Reorganization (A2C-R) that collects actor features for context feature reorganizations, and the Context-to-Actor Enhancement (C2A-E) that dynamically utilizes reorganized context features for actor feature enhancement. Compared to existing designs that focus on C2A-E, our CycleACR introduces A2C-R for a more effective relation modeling. This modeling advances our CycleACR to achieve state-of-the-art performance on two popular action detection datasets (i.e., AVA and UCF101-24). We also provide ablation studies and visualizations as well to show how our cycle actor-context relation modeling improves video action detection. Code is available at https://github.com/MCG-NJU/CycleACR.
翻译:演员与场景上下文之间的关系建模推动了视频动作检测的发展,其中多个演员的关联性使得动作识别颇具挑战。现有研究通过建模每个演员与场景的关系来改进动作识别。然而,场景变化和背景干扰限制了这种关系建模的有效性。本文提出选择与演员相关的场景上下文(而非直接利用原始视频场景)来改进关系建模。我们构建了一个循环演员-上下文关系网络(CycleACR),其中包含一个双向对称图,用于建模演员与上下文之间的关系。我们的CycleACR包含两个模块:演员到上下文重组(A2C-R)模块,用于收集演员特征以重新组织上下文特征;以及上下文到演员增强(C2A-E)模块,动态利用重组后的上下文特征来增强演员特征。与现有聚焦于C2A-E的设计相比,我们的CycleACR引入A2C-R以实现更有效的关系建模。这一建模方法使CycleACR在两个主流动作检测数据集(即AVA和UCF101-24)上达到了最先进的性能。我们还通过消融实验和可视化分析,展示了循环演员-上下文关系建模如何提升视频动作检测效果。代码已开源:https://github.com/MCG-NJU/CycleACR。