Video action recognition is a fundamental task in computer vision, but state-of-the-art models are often computationally expensive and rely on extensive video pre-training. In parallel, large-scale vision-language models like Contrastive Language-Image Pre-training (CLIP) offer powerful zero-shot capabilities on static images, while motion vectors (MV) provide highly efficient temporal information directly from compressed video streams. To synergize the strengths of these paradigms, we propose MoCLIP-Lite, a simple yet powerful two-stream late fusion framework for efficient video recognition. Our approach combines features from a frozen CLIP image encoder with features from a lightweight, supervised network trained on raw MV. During fusion, both backbones are frozen, and only a tiny Multi-Layer Perceptron (MLP) head is trained, ensuring extreme efficiency. Through comprehensive experiments on the UCF101 dataset, our method achieves a remarkable 89.2% Top-1 accuracy, significantly outperforming strong zero-shot (65.0%) and MV-only (66.5%) baselines. Our work provides a new, highly efficient baseline for video understanding that effectively bridges the gap between large static models and dynamic, low-cost motion cues. Our code and models are available at https://github.com/microa/MoCLIP-Lite.
翻译:视频动作识别是计算机视觉中的一项基础任务,但当前最先进的模型通常计算成本高昂,且依赖于大量的视频预训练。与此同时,大规模视觉-语言模型,如对比语言-图像预训练(CLIP),在静态图像上提供了强大的零样本能力,而运动向量(MV)则能直接从压缩视频流中提取高效的时序信息。为了协同这两种范式的优势,我们提出了MoCLIP-Lite,一个简单而强大的双流晚期融合框架,用于高效的视频识别。我们的方法将冻结的CLIP图像编码器的特征与一个在原始MV上训练的轻量级有监督网络的特征相结合。在融合过程中,两个骨干网络均被冻结,仅训练一个微小的多层感知机(MLP)头部,从而确保了极高的效率。通过在UCF101数据集上的全面实验,我们的方法取得了89.2%的Top-1准确率,显著优于强大的零样本基线(65.0%)和纯MV基线(66.5%)。我们的工作为视频理解提供了一个全新的、高效的基准,有效地弥合了大型静态模型与动态、低成本运动线索之间的差距。我们的代码和模型可在 https://github.com/microa/MoCLIP-Lite 获取。