CLIP has demonstrated strong generalization in visual domains through natural language supervision, even for video action recognition. However, most existing approaches that adapt CLIP for action recognition have primarily focused on temporal modeling, often overlooking spatial perception. In real-world scenarios, visual challenges such as low-light environments or egocentric viewpoints can severely impair spatial understanding, an essential precursor for effective temporal reasoning. To address this limitation, we propose Efficient Visual Prompting for CLIP (EV-CLIP), an efficient adaptation framework designed for few-shot video action recognition across diverse scenes and viewpoints. EV-CLIP introduces two visual prompts: mask prompts, which guide the model's attention to action-relevant regions by reweighting pixels, and context prompts, which perform lightweight temporal modeling by compressing frame-wise features into a compact representation. For a comprehensive evaluation, we curate five benchmark datasets and analyze domain shifts to quantify the influence of diverse visual and semantic factors on action recognition. Experimental results demonstrate that EV-CLIP outperforms existing parameter-efficient methods in overall performance. Moreover, its efficiency remains independent of the backbone scale, making it well-suited for deployment in real-world, resource-constrained scenarios. The code is available at https://github.com/AI-CV-Lab/EV-CLIP.
翻译:CLIP通过自然语言监督在视觉领域展现出强大的泛化能力,甚至可用于视频动作识别。然而,现有将CLIP适配至动作识别的方法主要聚焦于时序建模,往往忽视了空间感知。在真实场景中,低光照环境或自我中心视角等视觉挑战会严重削弱空间理解能力,而空间理解正是有效时序推理的基础。为解决这一局限,我们提出面向CLIP的高效视觉提示方法(EV-CLIP),这是一个为跨场景与视角的小样本视频动作识别设计的高效适配框架。EV-CLIP引入两类视觉提示:掩码提示(mask prompts)通过重加权像素引导模型关注动作相关区域,以及上下文提示(context prompts)通过将帧级特征压缩为紧凑表示实现轻量级时序建模。为进行全面评估,我们构建了五个基准数据集,并通过分析域迁移来量化多样视觉与语义因素对动作识别的影响。实验结果表明,EV-CLIP在整体性能上优于现有参数高效方法。此外,其效率与骨干网络规模无关,使其非常适合部署于资源受限的真实场景。代码已开源在https://github.com/AI-CV-Lab/EV-CLIP。