Recent advancements in large-scale pre-training of visual-language models on paired image-text data have demonstrated impressive generalization capabilities for zero-shot tasks. Building on this success, efforts have been made to adapt these image-based visual-language models, such as CLIP, for videos extending their zero-shot capabilities to the video domain. While these adaptations have shown promising results, they come at a significant computational cost and struggle with effectively modeling the crucial temporal aspects inherent to the video domain. In this study, we present EZ-CLIP, a simple and efficient adaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal visual prompting for seamless temporal adaptation, requiring no fundamental alterations to the core CLIP architecture while preserving its remarkable generalization abilities. Moreover, we introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion, thereby enhancing its learning capabilities from video data. We conducted extensive experiments on five different benchmark datasets, thoroughly evaluating EZ-CLIP for zero-shot learning and base-to-novel video action recognition, and also demonstrating its potential for few-shot generalization.Impressively, with a mere 5.2 million learnable parameters (as opposed to the 71.1 million in the prior best model), EZ-CLIP can be efficiently trained on a single GPU, outperforming existing approaches in several evaluations.
翻译:近年来,基于图像-文本配对数据的大规模视觉-语言模型预训练技术取得了显著进展,展现出在零样本任务中令人印象深刻的泛化能力。基于这一成功,研究者们致力于将CLIP等基于图像的视觉-语言模型适配到视频领域,从而将其零样本能力拓展至视频域。尽管这些适配方法已取得一定成果,但带来了巨大的计算成本,且在有效建模视频域中关键的时序特征方面存在困难。本研究提出EZ-CLIP——一种简洁高效的CLIP适配方法,有效解决了上述挑战。EZ-CLIP利用时序视觉提示实现无缝时序适配,无需对CLIP核心架构进行根本性修改,同时保持其卓越的泛化能力。此外,我们引入了一种新型学习目标,引导时序视觉提示聚焦于运动特征的捕捉,从而增强其从视频数据中的学习能力。我们在五个不同的基准数据集上进行了广泛实验,全面评估了EZ-CLIP在零样本学习与基类到新类视频动作识别中的表现,并展示了其在少样本泛化方面的潜力。令人瞩目的是,EZ-CLIP仅需520万个可学习参数(相比之下,先前最佳模型需7110万个参数),即可在单GPU上高效训练,并在多项评估中超越现有方法。