Recent few-shot action recognition (FSAR) methods achieve promising performance by performing semantic matching on learned discriminative features. However, most FSAR methods focus on single-scale (e.g., frame-level, segment-level, \etc) feature alignment, which ignores that human actions with the same semantic may appear at different velocities. To this end, we develop a novel Multi-Velocity Progressive-alignment (MVP-Shot) framework to progressively learn and align semantic-related action features at multi-velocity levels. Concretely, a Multi-Velocity Feature Alignment (MVFA) module is designed to measure the similarity between features from support and query videos with different velocity scales and then merge all similarity scores in a residual fashion. To avoid the multiple velocity features deviating from the underlying motion semantic, our proposed Progressive Semantic-Tailored Interaction (PSTI) module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains at different velocities. The above two modules compensate for each other to predict query categories more accurately under the few-shot settings. Experimental results show our method outperforms current state-of-the-art methods on multiple standard few-shot benchmarks (i.e., HMDB51, UCF101, Kinetics, and SSv2-small).
翻译:近期,少样本动作识别方法通过学习判别性特征进行语义匹配取得了显著性能。然而,多数方法仅关注单尺度(如帧级、片段级等)特征对齐,忽略了相同语义的人体动作可能以不同速度出现的问题。为此,我们提出新颖的多速度渐进对齐(MVP-Shot)框架,以逐步学习并对齐多速度层级上的语义相关动作特征。具体而言,设计多速度特征对齐(MVFA)模块,通过不同速度尺度衡量支持视频与查询视频的特征相似度,并以残差方式融合所有相似度分数。为避免多速度特征偏离潜在运动语义,我们提出的渐进语义定制交互(PSTI)模块在不同速度下,通过在通道域和时间域进行特征交互,将速度定制的文本信息注入视频特征。以上两个模块相互补充,在少样本场景下更精确地预测查询类别。实验结果表明,在多个标准少样本基准(HMDB51、UCF101、Kinetics和SSv2-small)上,本方法均优于现有最先进方法。