In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: https://visual-ai.github.io/froster.
翻译:本文提出FROSTER——一个高效的开放词汇动作识别框架。CLIP模型凭借其在大规模图文对预训练中获得的强大泛化能力,已在多项基于图像的任务中取得显著成功。然而,由于CLIP预训练过程中缺乏时序信息,将其直接应用于开放词汇动作识别任务具有挑战性。此外,在动作识别数据集上微调CLIP可能导致过拟合并损害其泛化能力,导致处理未见动作时效果不佳。为解决上述问题,FROSTER采用残差特征蒸馏方法,确保CLIP在保留泛化能力的同时有效适配动作识别任务。具体而言,残差特征蒸馏将冻结的CLIP模型视为教师,以维持原始CLIP的泛化性,并通过监督视频特定特征的学习来弥合图像与视频之间的差距。同时,该方法利用残差子网络进行特征蒸馏,在通用特征学习与视频特定特征学习这两个不同目标之间取得平衡。我们在基础到新类及跨数据集设定下,对FROSTER在开放词汇动作识别基准上进行了全面评估。FROSTER在所有数据集上均持续取得最先进性能。项目页面:https://visual-ai.github.io/froster。