Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet, which demonstrates the importance of prototypes. Although they achieve relatively good performance, the effect of multimodal information is ignored, e.g. label texts. In this work, we propose a novel MultimOdal PRototype-ENhanced Network (MORN), which uses the semantic information of label texts as multimodal information to enhance prototypes. A CLIP visual encoder and a frozen CLIP text encoder are introduced to obtain features with good multimodal initialization. Then in the visual flow, visual prototypes are computed by a visual prototype-computed module. In the text flow, a semantic-enhanced (SE) module and an inflating operation are used to obtain text prototypes. The final multimodal prototypes are then computed by a multimodal prototype-enhanced (MPE) module. Besides, we define a PRototype SImilarity DiffErence (PRIDE) to evaluate the quality of prototypes, which is used to verify our improvement on the prototype level and effectiveness of MORN. We conduct extensive experiments on four popular few-shot action recognition datasets: HMDB51, UCF101, Kinetics and SSv2, and MORN achieves state-of-the-art results. When plugging PRIDE into the training stage, the performance can be further improved.
翻译:当前小样本动作识别方法主要沿袭ProtoNet的度量学习框架,该框架证明了原型的重要性。尽管这些方法取得了相对良好的性能,但多模态信息(如标签文本)的作用却被忽视。本文提出一种新型多模态原型增强网络(MORN),利用标签文本的语义信息作为多模态信息来增强原型。通过引入CLIP视觉编码器和冻结的CLIP文本编码器,获得具有良好多模态初始化的特征。在视觉流中,通过视觉原型计算模块计算视觉原型;在文本流中,通过语义增强(SE)模块和膨胀操作获取文本原型。随后由多模态原型增强(MPE)模块计算最终的多模态原型。此外,我们定义了原型相似度差异(PRIDE)指标评估原型质量,用于验证原型层面的改进效果及MORN的有效性。在HMDB51、UCF101、Kinetics和SSv2四个流行小样本动作识别数据集上的实验表明,MORN取得了最优结果。将PRIDE引入训练阶段后,性能可进一步提升。