Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet, which demonstrates the importance of prototypes. Although they achieve relatively good performance, the effect of multimodal information is ignored, e.g. label texts. In this work, we propose a novel MultimOdal PRototype-ENhanced Network (MORN), which uses the semantic information of label texts as multimodal information to enhance prototypes. A CLIP visual encoder and a frozen CLIP text encoder are introduced to obtain features with good multimodal initialization. Then in the visual flow, visual prototypes are computed by a Temporal-Relational CrossTransformer (TRX) module for example. In the text flow, a semantic-enhanced (SE) module and an inflating operation are used to obtain text prototypes. The final multimodal prototypes are then computed by a multimodal prototype-enhanced (MPE) module. Besides, we define a PRototype SImilarity DiffErence (PRIDE) to evaluate the quality of prototypes, which is used to verify our improvement on the prototype level and effectiveness of MORN. We conduct extensive experiments on four popular datasets, and MORN achieves state-of-the-art results on HMDB51, UCF101, Kinetics and SSv2. When plugging PRIDE into the training stage, the performance can be further improved.
翻译:当前的小样本动作识别方法主要遵循ProtoNet的度量学习框架,该框架证明了原型的重要性。尽管这些方法取得了相对较好的性能,但多模态信息(如标签文本)的作用被忽视了。本文提出了一种新颖的多模态原型增强网络(MORN),利用标签文本的语义信息作为多模态信息来增强原型。我们引入CLIP视觉编码器和冻结的CLIP文本编码器,以获得具有良好多模态初始化的特征。随后在视觉流中,通过时序关系交叉变换器(TRX)模块计算视觉原型;在文本流中,采用语义增强(SE)模块和膨胀操作获取文本原型。最终通过多模态原型增强(MPE)模块计算多模态原型。此外,我们定义了原型相似度差异(PRIDE)来评估原型质量,用于验证我们在原型层面的改进以及MORN的有效性。在四个主流数据集上的大量实验表明,MORN在HMDB51、UCF101、Kinetics和SSv2上均达到了最先进水平。将PRIDE引入训练阶段后,性能可进一步提升。