Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and cannot capture context information outside the bounding box. Recently, a few query-based action detectors are proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose a new one-stage sparse action detector, termed STMixer. STMixer is based on two core designs. First, we present a query-based adaptive feature sampling module, which endows our STMixer with the flexibility of mining a set of discriminative features from the entire spatiotemporal domain. Second, we devise a dual-branch feature mixing module, which allows our STMixer to dynamically attend to and mix video features along the spatial and the temporal dimension respectively for better feature decoding. Coupling these two designs with a video backbone yields an efficient end-to-end action detector. Without bells and whistles, our STMixer obtains the state-of-the-art results on the datasets of AVA, UCF101-24, and JHMDB.
翻译:传统视频动作检测器通常采用两阶段流水线,即先使用人物检测器生成演员框,再通过3D RoIAlign提取演员专属特征进行分类。这种检测范式需要多阶段训练与推理,且无法捕获边界框之外的上下文信息。近期,少数基于查询的动作检测器被提出,能够以端到端的方式预测动作实例。然而,这些方法在特征采样与解码方面仍缺乏适应性,从而导致性能欠佳或收敛速度缓慢的问题。本文提出一种新型单阶段稀疏动作检测器,命名为STMixer。STMixer基于两大核心设计:首先,我们提出基于查询的自适应特征采样模块,赋予STMixer从整个时空域中挖掘一组判别性特征的灵活性;其次,我们设计了双分支特征混合模块,使STMixer能够分别沿空间与时间维度动态关注并混合视频特征,从而实现更优的特征解码。将这两项设计与视频骨干网络耦合,即构成高效的端到端动作检测器。无需额外复杂机制,我们的STMixer在AVA、UCF101-24和JHMDB数据集上均取得了最先进的结果。