Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

Zero-shot video recognition (ZSVR) is a task that aims to recognize video categories that have not been seen during the model training process. Recently, vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability for ZSVR. To make VLMs applicable to the video domain, existing methods often use an additional temporal learning module after the image-level encoder to learn the temporal relationships among video frames. Unfortunately, for video from unseen categories, we observe an abnormal phenomenon where the model that uses spatial-temporal feature performs much worse than the model that removes temporal learning module and uses only spatial feature. We conjecture that improper temporal modeling on video disrupts the spatial feature of the video. To verify our hypothesis, we propose Feature Factorization to retain the orthogonal temporal feature of the video and use interpolation to construct refined spatial-temporal feature. The model using appropriately refined spatial-temporal feature performs better than the one using only spatial feature, which verifies the effectiveness of the orthogonal temporal feature for the ZSVR task. Therefore, an Orthogonal Temporal Interpolation module is designed to learn a better refined spatial-temporal video feature during training. Additionally, a Matching Loss is introduced to improve the quality of the orthogonal temporal feature. We propose a model called OTI for ZSVR by employing orthogonal temporal interpolation and the matching loss based on VLMs. The ZSVR accuracies on popular video datasets (i.e., Kinetics-600, UCF101 and HMDB51) show that OTI outperforms the previous state-of-the-art method by a clear margin.

翻译：零样本视频识别（ZSVR）是一项旨在识别模型训练过程中未见过的视频类别的任务。近年来，在大规模图像-文本对上进行预训练的视觉-语言模型（VLM）在ZSVR中展现出显著的迁移能力。为使VLM适用于视频领域，现有方法通常在图像层编码器后添加额外的时间学习模块，以学习视频帧间的时间关系。然而，对于来自未见类别的视频，我们观察到一种异常现象：使用时空特征的模型性能远低于移除时间学习模块仅使用空间特征的模型。我们推测，对视频进行不当的时间建模会破坏视频的空间特征。为验证这一假设，我们提出特征分解（Feature Factorization）以保留视频的正交时间特征，并通过插值构建精细化的时空特征。使用适当精细化时空特征的模型优于仅使用空间特征的模型，这验证了正交时间特征对ZSVR任务的有效性。因此，我们设计了一个正交时间插值（Orthogonal Temporal Interpolation）模块，以在训练过程中学习更精细化的时空视频特征。此外，引入匹配损失（Matching Loss）以提升正交时间特征的质量。基于VLM，我们提出了一种名为OTI的ZSVR模型，该模型采用正交时间插值和匹配损失。在主流视频数据集（即Kinetics-600、UCF101和HMDB51）上的ZSVR准确率表明，OTI以显著优势超越了此前最先进的方法。