Learning from large-scale contrastive language-image pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue due to data scarcity, which is a critical problem in low-shot regimes. To this end, we present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of two key components: a video-text contrastive objective and a prototype modulation. Specifically, the former bridges the task discrepancy between CLIP and the few-shot video task by contrasting videos and corresponding class text descriptions. The latter leverages the transferable textual concepts from CLIP to adaptively refine visual prototypes with a temporal Transformer. By this means, CLIP-FSAR can take full advantage of the rich semantic priors in CLIP to obtain reliable prototypes and achieve accurate few-shot classification. Extensive experiments on five commonly used benchmarks demonstrate the effectiveness of our proposed method, and CLIP-FSAR significantly outperforms existing state-of-the-art methods under various settings. The source code and models will be publicly available at https://github.com/alibaba-mmai-research/CLIP-FSAR.
翻译:最近,从大规模对比语言-图像预训练(如CLIP)中学习在广泛的下游任务中取得了显著成功,但在具有挑战性的少样本动作识别任务中仍探索不足。本文旨在迁移CLIP强大的多模态知识,以缓解因数据稀缺导致的原型估计不准确问题——这是低样本场景下的关键难点。为此,我们提出了一种CLIP引导的原型调制框架CLIP-FSAR,包含两个关键组件:视频-文本对比目标和原型调制。具体而言,前者通过对比视频与对应类别的文本描述,弥合CLIP与少样本视频任务之间的任务差异;后者利用CLIP中可迁移的文本概念,借助时序Transformer自适应地精炼视觉原型。通过这种方式,CLIP-FSAR能充分利用CLIP中丰富的语义先验,获得可靠的原型并实现精准的少样本分类。在五个常用基准上的广泛实验证明了所提方法的有效性,且CLIP-FSAR在多种设置下显著优于现有最先进方法。源代码和模型将公开于https://github.com/alibaba-mmai-research/CLIP-FSAR。