Vision Transformer (ViT) has shown high potential in video recognition, owing to its flexible design, adaptable self-attention mechanisms, and the efficacy of masked pre-training. Yet, it remains unclear how to adapt these pre-trained short-term ViTs for temporal action detection (TAD) in untrimmed videos. The existing works treat them as off-the-shelf feature extractors for each short-trimmed snippet without capturing the fine-grained relation among different snippets in a broader temporal context. To mitigate this issue, this paper focuses on designing a new mechanism for adapting these pre-trained ViT models as a unified long-form video transformer to fully unleash its modeling power in capturing inter-snippet relation, while still keeping low computation overhead and memory consumption for efficient TAD. To this end, we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels. For inner-backbone information propagation, we introduce a cross-snippet propagation strategy to enable multi-snippet temporal feature interaction inside the backbone.For post-backbone information propagation, we propose temporal transformer layers for further clip-level modeling. With the plain ViT-B pre-trained with VideoMAE, our end-to-end temporal action detector (ViT-TAD) yields a very competitive performance to previous temporal action detectors, riching up to 69.5 average mAP on THUMOS14, 37.40 average mAP on ActivityNet-1.3 and 17.20 average mAP on FineAction.
翻译:视觉Transformer(ViT)因其灵活的设计、可适应的自注意力机制以及掩码预训练的有效性,在视频识别领域展现出巨大潜力。然而,如何将这些预训练的短时ViT模型适配于未裁剪视频中的时序动作检测(TAD)仍不明确。现有工作中,这些模型被用作每个短时片段的现成特征提取器,未能捕捉更广泛时序上下文中不同片段间的细粒度关系。为解决此问题,本文聚焦于设计一种新机制,将预训练ViT模型适配为统一的长时视频Transformer,以充分释放其在捕捉片段间关系方面的建模能力,同时保持低计算开销和内存消耗以实现高效TAD。为此,我们设计了有效的跨片段传播模块,从两个层面逐步交换不同片段间的短时视频信息。在骨干网络内部信息传播方面,我们引入跨片段传播策略,使骨干网络内部实现多片段时序特征交互。在骨干网络后信息传播方面,我们提出时序Transformer层用于进一步的片段级建模。采用VideoMAE预训练的纯ViT-B模型,我们的端到端时序动作检测器(ViT-TAD)在THUMOS14上达到69.5平均mAP,在ActivityNet-1.3上达到37.40平均mAP,在FineAction上达到17.20平均mAP,展现出与先前时序动作检测器极具竞争力的性能。