MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

Recent few-shot action recognition (FSAR) methods achieve promising performance by performing semantic matching on learned discriminative features. However, most FSAR methods focus on single-scale (e.g., frame-level, segment-level, \etc) feature alignment, which ignores that human actions with the same semantic may appear at different velocities. To this end, we develop a novel Multi-Velocity Progressive-alignment (MVP-Shot) framework to progressively learn and align semantic-related action features at multi-velocity levels. Concretely, a Multi-Velocity Feature Alignment (MVFA) module is designed to measure the similarity between features from support and query videos with different velocity scales and then merge all similarity scores in a residual fashion. To avoid the multiple velocity features deviating from the underlying motion semantic, our proposed Progressive Semantic-Tailored Interaction (PSTI) module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains at different velocities. The above two modules compensate for each other to predict query categories more accurately under the few-shot settings. Experimental results show our method outperforms current state-of-the-art methods on multiple standard few-shot benchmarks (i.e., HMDB51, UCF101, Kinetics, and SSv2-small).

翻译：近期，少样本动作识别方法通过学习判别性特征进行语义匹配取得了显著性能。然而，多数方法仅关注单尺度（如帧级、片段级等）特征对齐，忽略了相同语义的人体动作可能以不同速度出现的问题。为此，我们提出新颖的多速度渐进对齐（MVP-Shot）框架，以逐步学习并对齐多速度层级上的语义相关动作特征。具体而言，设计多速度特征对齐（MVFA）模块，通过不同速度尺度衡量支持视频与查询视频的特征相似度，并以残差方式融合所有相似度分数。为避免多速度特征偏离潜在运动语义，我们提出的渐进语义定制交互（PSTI）模块在不同速度下，通过在通道域和时间域进行特征交互，将速度定制的文本信息注入视频特征。以上两个模块相互补充，在少样本场景下更精确地预测查询类别。实验结果表明，在多个标准少样本基准（HMDB51、UCF101、Kinetics和SSv2-small）上，本方法均优于现有最先进方法。

相关内容

小样本学习

关注 216

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日