The task of spatiotemporal action localization in chaotic scenes is a challenging task toward advanced video understanding. Paving the way with high-quality video feature extraction and enhancing the precision of detector-predicted anchors can effectively improve model performance. To this end, we propose a high-performance dual-stream spatiotemporal feature extraction network SFMViT with an anchor pruning strategy. The backbone of our SFMViT is composed of ViT and SlowFast with prior knowledge of spatiotemporal action localization, which fully utilizes ViT's excellent global feature extraction capabilities and SlowFast's spatiotemporal sequence modeling capabilities. Secondly, we introduce the confidence maximum heap to prune the anchors detected in each frame of the picture to filter out the effective anchors. These designs enable our SFMViT to achieve a mAP of 26.62% in the Chaotic World dataset, far exceeding existing models. Code is available at https://github.com/jfightyr/SlowFast-Meet-ViT.
翻译:面向混沌场景的时空动作定位任务是迈向高级视频理解的一项挑战性工作。通过高质量视频特征提取为检测器预测的锚框铺平道路,并提升其精度,可有效改进模型性能。为此,我们提出了一种高性能双流时空特征提取网络SFMViT,并搭配锚框剪枝策略。该网络的主干结合了ViT与具备时空动作定位先验知识的SlowFast,充分利用ViT卓越的全局特征提取能力与SlowFast的时空序列建模能力。其次,我们引入置信度最大堆对每帧图像中检测到的锚框进行剪枝,以筛选出有效锚框。这些设计使得我们的SFMViT在Chaotic World数据集上达到了26.62%的平均精度均值(mAP),远超现有模型。代码开源地址:https://github.com/jfightyr/SlowFast-Meet-ViT。