Applying large-scale pre-trained visual models like CLIP to few-shot action recognition tasks can benefit performance and efficiency. Utilizing the "pre-training, fine-tuning" paradigm makes it possible to avoid training a network from scratch, which can be time-consuming and resource-intensive. However, this method has two drawbacks. First, limited labeled samples for few-shot action recognition necessitate minimizing the number of tunable parameters to mitigate over-fitting, also leading to inadequate fine-tuning that increases resource consumption and may disrupt the generalized representation of models. Second, the video's extra-temporal dimension challenges few-shot recognition's effective temporal modeling, while pre-trained visual models are usually image models. This paper proposes a novel method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues. It adapts CLIP for few-shot action recognition by adding lightweight adapters, which can minimize the number of learnable parameters and enable the model to transfer across different tasks quickly. The adapters we design can combine information from video-text multimodal sources for task-oriented spatiotemporal modeling, which is fast, efficient, and has low training costs. Additionally, based on the attention mechanism, we design a text-guided prototype construction module that can fully utilize video-text information to enhance the representation of video prototypes. Our MA-CLIP is plug-and-play, which can be used in any different few-shot action recognition temporal alignment metric.
翻译:应用大规模预训练视觉模型(如CLIP)到小样本动作识别任务可提升性能与效率。借助"预训练-微调"范式,可避免从头训练网络,从而节省时间与计算资源。然而,该方法存在两个缺陷:其一,小样本动作识别中有限的标注样本要求最小化可调参数数量以缓解过拟合,但参数减少会导致微调不充分,既增加资源消耗又可能破坏模型的通用表征能力;其二,视频特有的时间维度为有效时序建模带来挑战,而预训练视觉模型通常基于图像模态。本文提出名为"CLIP多模态适配"(MA-CLIP)的新方法以解决上述问题。该方法通过添加轻量级适配器对CLIP进行适配,在最小化可学习参数的同时实现快速跨任务迁移。所设计的适配器可融合视频-文本多模态信息进行面向任务的时空建模,兼具高效性与低训练成本。此外,基于注意力机制,我们构建了文本引导的原型构造模块,能充分利用视频-文本信息增强视频原型表征。MA-CLIP具有即插即用特性,可适配任意小样本动作识别时序对齐度量方法。