The Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training and has applied to many downstream tasks. We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method. We propose that the key lies in explicitly modeling the motion cues flowing in video frames. To that end, we design a two-stream motion modeling block to capture motion and spatial information at the same time. And then, the obtained motion cues are utilized to drive a dynamic prompts learner to generate motion-aware prompts, which contain much semantic information concerning human actions. In addition, we propose a multimodal communication block to achieve a collaborative learning and further improve the performance. We conduct extensive experiments on HMDB-51, UCF-101, and Kinetics-400 datasets. Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training. We also achieve competitive performance on "closed-set" training with extremely few trainable parameters and additional computational costs.
翻译:对比语言-图像预训练模型(CLIP)近年来在"零样本"训练中展现出卓越的泛化能力,并已被应用于众多下游任务。我们探索将CLIP适配至更高效、更具泛化性的动作识别方法。我们提出关键在于显式建模视频帧中流动的运动线索。为此,我们设计了一个双流运动建模模块,以同时捕获运动信息与空间信息。随后,利用所获取的运动线索驱动动态提示学习器生成运动感知提示,其中包含了大量与人类动作相关的语义信息。此外,我们提出一个多模态通信模块以实现协作学习,并进一步提升性能。我们在HMDB-51、UCF-101和Kinetics-400数据集上进行了广泛实验。在"小样本"和"零样本"训练条件下,我们的方法显著超越了现有最先进方法。在"闭集"训练中,我们也以极少的可训练参数和额外计算成本取得了具有竞争力的性能。