Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named \name to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.
翻译:近期,大规模视觉-语言预训练模型(如CLIP)与参数高效微调(PEFT)技术的兴起,为视频动作识别领域带来了显著关注。然而,现有方法往往倾向于优先追求强监督性能,却在此过程中牺牲了模型的泛化能力。本文提出了一种名为\name的新型多模态多任务CLIP适配框架,旨在同时保持高监督性能与强迁移能力。首先,为增强各模态架构,我们在视觉分支和文本分支中分别引入多模态适配器。具体而言,我们设计了一种新颖的视觉TED-Adapter,通过全局时间增强与局部时间差分建模,提升视觉编码器的时间表征能力。同时,采用文本编码器适配器以强化语义标签信息的学习。其次,我们设计了一个具有丰富监督信号的多任务解码器,在多模态框架内灵活兼顾强监督性能与泛化需求。实验结果表明,本方法在监督学习中表现出卓越性能,同时在零样本场景中保持了强大的泛化能力。