Transferring vision-language knowledge from pretrained multimodal foundation models to various downstream tasks is a promising direction. However, most current few-shot action recognition methods are still limited to a single visual modality input due to the high cost of annotating additional textual descriptions. In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text. To be specific, we first utilize a captioning foundation model (i.e., BLIP) to extract visual features and automatically generate associated captions for input videos. Then, we apply a text encoder to the synthetic captions to obtain representative text embeddings. Finally, a visual-text aggregation module based on Transformer is further designed to incorporate cross-modal spatio-temporal complementary information for reliable few-shot matching. In this way, CapFSAR can benefit from powerful multimodal knowledge of pretrained foundation models, yielding more comprehensive classification in the low-shot regime. Extensive experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods and achieves state-of-the-art performance. The code will be made publicly available.
翻译:将预训练多模态基础模型的视觉语言知识迁移至各种下游任务是一个有前景的研究方向。然而,由于标注额外文本描述的高昂成本,当前大多数少样本动作识别方法仍局限于单一视觉模态输入。本文开发了一个有效的即插即用框架CapFSAR,无需人工标注文本即可利用多模态模型的知识。具体而言,我们首先利用字幕基础模型(即BLIP)提取视觉特征,并自动生成输入视频的相关字幕。然后,我们对合成字幕应用文本编码器,以获得代表性的文本嵌入。最后,我们进一步设计了一个基于Transformer的视觉-文本聚合模块,以融合跨模态时空互补信息,实现可靠的少样本匹配。通过这种方式,CapFSAR能够受益于预训练基础模型的强大多模态知识,在低样本场景下实现更全面的分类。在多个标准少样本基准上的大量实验表明,所提出的CapFSAR性能优于现有方法,达到了最先进的水平。相关代码将公开发布。