EZ-CLIP: Efficient Zeroshot Video Action Recognition

Recent advancements in large-scale pre-training of visual-language models on paired image-text data have demonstrated impressive generalization capabilities for zero-shot tasks. Building on this success, efforts have been made to adapt these image-based visual-language models, such as CLIP, for videos extending their zero-shot capabilities to the video domain. While these adaptations have shown promising results, they come at a significant computational cost and struggle with effectively modeling the crucial temporal aspects inherent to the video domain. In this study, we present EZ-CLIP, a simple and efficient adaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal visual prompting for seamless temporal adaptation, requiring no fundamental alterations to the core CLIP architecture while preserving its remarkable generalization abilities. Moreover, we introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion, thereby enhancing its learning capabilities from video data. We conducted extensive experiments on five different benchmark datasets, thoroughly evaluating EZ-CLIP for zero-shot learning and base-to-novel video action recognition, and also demonstrating its potential for few-shot generalization.Impressively, with a mere 5.2 million learnable parameters (as opposed to the 71.1 million in the prior best model), EZ-CLIP can be efficiently trained on a single GPU, outperforming existing approaches in several evaluations.

翻译：近期，基于图像-文本配对数据的大规模视觉-语言模型预训练取得了显著进展，在零样本任务中展现出令人印象深刻的泛化能力。基于这一成功，研究者们致力于将CLIP等基于图像的视觉-语言模型扩展到视频领域，使其零样本能力延伸到视频任务中。尽管这些改进方法已展现出有前景的结果，但它们伴随着巨大的计算开销，并且难以有效建模视频领域中至关重要的时间维度特征。在本研究中，我们提出EZ-CLIP，一种简洁高效的CLIP改进方法，旨在解决上述挑战。EZ-CLIP利用时间视觉提示实现无缝的时间适应性，无需对核心CLIP架构进行根本性改动，同时保持其卓越的泛化能力。此外，我们引入一种新颖的学习目标，引导时间视觉提示聚焦于捕捉运动信息，从而增强其从视频数据中学习的能力。我们在五个不同的基准数据集上进行了广泛实验，全面评估了EZ-CLIP在零样本学习和基类到新类视频动作识别任务中的性能，并展示了其在少样本泛化中的潜力。令人瞩目的是，EZ-CLIP仅需520万个可学习参数（相较于此前最优模型的7110万个），即可在单GPU上高效训练，并在多项评估中超越现有方法。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日