Training an effective video action recognition model poses significant computational challenges, particularly under limited resource budgets. Current methods primarily aim to either reduce model size or utilize pre-trained models, limiting their adaptability to various backbone architectures. This paper investigates the issue of over-sampled frames, a prevalent problem in many approaches yet it has received relatively little attention. Despite the use of fewer frames being a potential solution, this approach often results in a substantial decline in performance. To address this issue, we propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames. This feature restoration technique brings a negligible increase in computational requirements compared to resource-intensive image encoders, such as ViT. To evaluate the effectiveness of our method, we conduct extensive experiments on four public datasets, including Kinetics-400, ActivityNet, UCF-101, and HMDB-51. With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy. In addition, our method also surprisingly helps improve the generalization ability of the models under zero-shot settings.
翻译:训练有效的视频动作识别模型面临显著的计算挑战,尤其是在资源预算有限的情况下。当前方法主要致力于缩小模型规模或利用预训练模型,这限制了它们对不同骨干架构的适应性。本文研究了过采样帧这一在许多方法中普遍存在却鲜少关注的问题。尽管减少帧数是一种潜在解决方案,但这种方法往往导致性能大幅下降。为解决该问题,我们提出了一种新颖方法,用于恢复两个稀疏采样相邻视频帧的中间特征。与ViT等计算密集型图像编码器相比,这种特征恢复技术带来的计算开销微乎其微。为评估方法有效性,我们在四个公开数据集(包括Kinetics-400、ActivityNet、UCF-101和HMDB-51)上进行了广泛实验。集成我们的方法后,三种常用基线模型的效率提升了超过50%,而识别准确率仅下降0.5%。此外,我们的方法还意外地有助于提升模型在零样本设置下的泛化能力。