Recent advancements in large-scale pre-training of visual-language models on paired image-text data have demonstrated impressive generalization capabilities for zero-shot tasks. Building on this success, efforts have been made to adapt these image-based visual-language models, such as CLIP, for videos extending their zero-shot capabilities to the video domain. While these adaptations have shown promising results, they come at a significant computational cost and struggle with effectively modeling the crucial temporal aspects inherent to the video domain. In this study, we present EZ-CLIP, a simple and efficient adaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal visual prompting for seamless temporal adaptation, requiring no fundamental alterations to the core CLIP architecture while preserving its remarkable generalization abilities. Moreover, we introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion, thereby enhancing its learning capabilities from video data. We conducted extensive experiments on five different benchmark datasets, thoroughly evaluating EZ-CLIP for zero-shot learning and base-to-novel video action recognition, and also demonstrating its potential for few-shot generalization.Impressively, with a mere 5.2 million learnable parameters (as opposed to the 71.1 million in the prior best model), EZ-CLIP can be efficiently trained on a single GPU, outperforming existing approaches in several evaluations.
翻译:近期,基于图像-文本配对数据的大规模视觉-语言模型预训练取得了显著进展,在零样本任务中展现出令人印象深刻的泛化能力。基于这一成功,研究者们致力于将CLIP等基于图像的视觉-语言模型扩展到视频领域,使其零样本能力延伸到视频任务中。尽管这些改进方法已展现出有前景的结果,但它们伴随着巨大的计算开销,并且难以有效建模视频领域中至关重要的时间维度特征。在本研究中,我们提出EZ-CLIP,一种简洁高效的CLIP改进方法,旨在解决上述挑战。EZ-CLIP利用时间视觉提示实现无缝的时间适应性,无需对核心CLIP架构进行根本性改动,同时保持其卓越的泛化能力。此外,我们引入一种新颖的学习目标,引导时间视觉提示聚焦于捕捉运动信息,从而增强其从视频数据中学习的能力。我们在五个不同的基准数据集上进行了广泛实验,全面评估了EZ-CLIP在零样本学习和基类到新类视频动作识别任务中的性能,并展示了其在少样本泛化中的潜力。令人瞩目的是,EZ-CLIP仅需520万个可学习参数(相较于此前最优模型的7110万个),即可在单GPU上高效训练,并在多项评估中超越现有方法。