Action recognition in videos poses a challenge due to its high computational cost, especially for Joint Space-Time video transformers (Joint VT). Despite their effectiveness, the excessive number of tokens in such architectures significantly limits their efficiency. In this paper, we propose HaltingVT, an efficient video transformer adaptively removing redundant video patch tokens, which is primarily composed of a Joint VT and a Glimpser module. Specifically, HaltingVT applies data-adaptive token reduction at each layer, resulting in a significant reduction in the overall computational cost. Besides, the Glimpser module quickly removes redundant tokens in shallow transformer layers, which may even be misleading for video recognition tasks based on our observations. To further encourage HaltingVT to focus on the key motion-related information in videos, we design an effective Motion Loss during training. HaltingVT acquires video analysis capabilities and token halting compression strategies simultaneously in a unified training process, without requiring additional training procedures or sub-networks. On the Mini-Kinetics dataset, we achieved 75.0% top-1 ACC with 24.2 GFLOPs, as well as 67.2% top-1 ACC with an extremely low 9.9 GFLOPs. The code is available at https://github.com/dun-research/HaltingVT.
翻译:视频动作识别因其高昂计算成本而面临挑战,尤其对于联合时空视频Transformer(Joint VT)架构。尽管此类架构性能优异,但其大量词元显著制约了效率。本文提出HaltingVT——一种自适应移除冗余视频块词元的高效视频Transformer,主要由联合VT与Glimpser模块构成。具体而言,HaltingVT在每层应用数据自适应词元缩减,显著降低整体计算开销。此外,根据我们的观察,Glimpser模块在浅层Transformer中快速移除冗余词元——这些词元甚至可能误导视频识别任务。为促使HaltingVT聚焦视频中关键运动相关信息,我们在训练阶段设计了高效运动损失函数。通过统一训练流程,HaltingVT能同时习得视频分析能力与词元停等压缩策略,无需额外训练步骤或子网络。在Mini-Kinetics数据集上,我们以24.2 GFLOPs实现75.0% top-1准确率,并以极低的9.9 GFLOPs达到67.2% top-1准确率。代码见https://github.com/dun-research/HaltingVT。