Existing video recognition algorithms always conduct different training pipelines for inputs with different frame numbers, which requires repetitive training operations and multiplying storage costs. If we evaluate the model using other frames which are not used in training, we observe the performance will drop significantly (see Fig.1), which is summarized as Temporal Frequency Deviation phenomenon. To fix this issue, we propose a general framework, named Frame Flexible Network (FFN), which not only enables the model to be evaluated at different frames to adjust its computation, but also reduces the memory costs of storing multiple models significantly. Concretely, FFN integrates several sets of training sequences, involves Multi-Frequency Alignment (MFAL) to learn temporal frequency invariant representations, and leverages Multi-Frequency Adaptation (MFAD) to further strengthen the representation abilities. Comprehensive empirical validations using various architectures and popular benchmarks solidly demonstrate the effectiveness and generalization of FFN (e.g., 7.08/5.15/2.17% performance gain at Frame 4/8/16 on Something-Something V1 dataset over Uniformer). Code is available at https://github.com/BeSpontaneous/FFN.
翻译:现有的视频识别算法总是对具有不同帧数的输入采用不同的训练流程,这需要重复的训练操作和成倍的存储成本。如果使用训练中未使用的其他帧对模型进行评估,我们发现性能会显著下降(见图1),这一现象被总结为时频偏差。为解决此问题,我们提出了一个通用框架,名为帧灵活网络(FFN,Frame Flexible Network),该框架不仅能使模型在不同帧数下进行评估以调整其计算量,还能显著降低存储多个模型的内存成本。具体而言,FFN集成了多组训练序列,引入多频率对齐(MFAL,Multi-Frequency Alignment)以学习时频不变的表示,并利用多频率适应(MFAD,Multi-Frequency Adaptation)进一步增强表示能力。使用多种架构和主流基准进行的全面实证验证,充分证明了FFN的有效性和泛化能力(例如,在Something-Something V1数据集上相比Uniformer,在帧数为4/8/16时分别获得7.08/5.15/2.17%的性能提升)。代码可在 https://github.com/BeSpontaneous/FFN 获取。