Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/MUPPET
翻译:小样本学习(FS)和零样本学习(ZS)是两种将时序动作检测(TAD)扩展到新类别的不同方法。前者将预训练的视觉模型适应到每个类别仅需单个视频的新任务,而后者则通过利用新类别的语义描述无需任何训练样例。在这项工作中,我们引入了一种新的多模态小样本(MMFS)TAD问题,可视为FS-TAD与ZS-TAD的结合,通过同时利用小样本支持视频和新类别名称来解决。为解决该问题,我们进一步提出了一种新颖的多模态提示元学习方法(MUPPET)。该方法通过高效桥接预训练的视觉和语言模型,同时最大限度地重用已学能力。具体而言,我们通过使用元学习适配器配备的视觉语义分词器,将支持视频映射到视觉-语言模型的文本标记空间中,从而构建多模态提示。为应对类内大变异,我们还设计了一种查询特征调节机制。在ActivityNetv1.3和THUMOS14上的大量实验表明,我们的MUPPET方法以较大优势优于最先进的替代方法。我们还展示了MUPPET可轻松扩展到小样本目标检测问题,并在MS-COCO数据集上再次取得最先进性能。代码将在https://github.com/sauradip/MUPPET 提供。