Streaming video clips with large-scale video tokens impede vision transformers (ViTs) for efficient recognition, especially in video action detection where sufficient spatiotemporal representations are required for precise actor identification. In this work, we propose an end-to-end framework for efficient video action detection (EVAD) based on vanilla ViTs. Our EVAD consists of two specialized designs for video action detection. First, we propose a spatiotemporal token dropout from a keyframe-centric perspective. In a video clip, we maintain all tokens from its keyframe, preserve tokens relevant to actor motions from other frames, and drop out the remaining tokens in this clip. Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities. The region of interest (RoI) in our action detector is expanded into temporal domain. The captured spatiotemporal actor identity representations are refined via scene context in a decoder with the attention mechanism. These two designs make our EVAD efficient while maintaining accuracy, which is validated on three benchmark datasets (i.e., AVA, UCF101-24, JHMDB). Compared to the vanilla ViT backbone, our EVAD reduces the overall GFLOPs by 43% and improves real-time inference speed by 40% with no performance degradation. Moreover, even at similar computational costs, our EVAD can improve the performance by 1.1 mAP with higher resolution inputs. Code is available at https://github.com/MCG-NJU/EVAD.
翻译:大规模视频令牌流阻碍了视觉Transformer(ViT)在高效视频识别中的表现,尤其在视频动作检测任务中,需要充分的时空表征以实现精确的演员身份识别。为此,本文提出一种基于原生ViT的端到端高效视频动作检测框架(EVAD)。EVAD包含两项专为视频动作检测设计的机制:首先,提出以关键帧为中心的时空令牌丢弃策略——在视频片段中,保留关键帧的全部令牌,保留其他帧中与演员运动相关的令牌,并丢弃该片段中的剩余令牌;其次,利用剩余令牌精炼场景上下文以提升演员身份识别能力。我们将动作检测器中的感兴趣区域(RoI)扩展至时间维度,并通过注意力机制的解码器利用场景上下文精炼捕获的时空演员身份表征。这两项设计使EVAD在保持检测精度的同时实现高效性能,在AVA、UCF101-24和JHMDB三个基准数据集上的实验验证了其有效性。与原生ViT骨干网络相比,EVAD在无性能损失情况下将整体GFLOPs降低43%,实时推理速度提升40%。此外,在相同计算开销下,通过采用更高分辨率输入,EVAD可提升1.1 mAP的检测性能。代码已开源至https://github.com/MCG-NJU/EVAD。