Few-shot Action Recognition with Captioning Foundation Models

Transferring vision-language knowledge from pretrained multimodal foundation models to various downstream tasks is a promising direction. However, most current few-shot action recognition methods are still limited to a single visual modality input due to the high cost of annotating additional textual descriptions. In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text. To be specific, we first utilize a captioning foundation model (i.e., BLIP) to extract visual features and automatically generate associated captions for input videos. Then, we apply a text encoder to the synthetic captions to obtain representative text embeddings. Finally, a visual-text aggregation module based on Transformer is further designed to incorporate cross-modal spatio-temporal complementary information for reliable few-shot matching. In this way, CapFSAR can benefit from powerful multimodal knowledge of pretrained foundation models, yielding more comprehensive classification in the low-shot regime. Extensive experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods and achieves state-of-the-art performance. The code will be made publicly available.

翻译：将预训练多模态基础模型的视觉语言知识迁移至各种下游任务是一个有前景的研究方向。然而，由于标注额外文本描述的高昂成本，当前大多数少样本动作识别方法仍局限于单一视觉模态输入。本文开发了一个有效的即插即用框架CapFSAR，无需人工标注文本即可利用多模态模型的知识。具体而言，我们首先利用字幕基础模型（即BLIP）提取视觉特征，并自动生成输入视频的相关字幕。然后，我们对合成字幕应用文本编码器，以获得代表性的文本嵌入。最后，我们进一步设计了一个基于Transformer的视觉-文本聚合模块，以融合跨模态时空互补信息，实现可靠的少样本匹配。通过这种方式，CapFSAR能够受益于预训练基础模型的强大多模态知识，在低样本场景下实现更全面的分类。在多个标准少样本基准上的大量实验表明，所提出的CapFSAR性能优于现有方法，达到了最先进的水平。相关代码将公开发布。

相关内容

小样本学习

关注 216

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日