Current state-of-the-art approaches for few-shot action recognition achieve promising performance by conducting frame-level matching on learned visual features. However, they generally suffer from two limitations: i) the matching procedure between local frames tends to be inaccurate due to the lack of guidance to force long-range temporal perception; ii) explicit motion learning is usually ignored, leading to partial information loss. To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder. Specifically, the long-short contrastive objective is to endow local frame features with long-form temporal awareness by maximizing their agreement with the global token of videos belonging to the same class. The motion autodecoder is a lightweight architecture to reconstruct pixel motions from the differential features, which explicitly embeds the network with motion dynamics. By this means, MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching. To demonstrate the effectiveness, we evaluate MoLo on five standard benchmarks, and the results show that MoLo favorably outperforms recent advanced methods. The source code is available at https://github.com/alibaba-mmai-research/MoLo.
翻译:当前最先进的少样本动作识别方法通过在学习到的视觉特征上进行帧级匹配取得了令人瞩目的性能。然而,此类方法通常面临两个局限性:i) 由于缺乏强制长时域感知的引导,局部帧间的匹配过程往往不准确;ii) 显式运动学习通常被忽略,导致部分信息丢失。为解决上述问题,本文提出一种运动增强型长短期对比学习(MoLo)方法,其包含两个关键组件:长短期对比目标函数和运动自动解码器。具体而言,长短期对比目标函数通过最大化局部帧特征与同类视频全局令牌之间的语义一致性,赋予局部帧特征长时序感知能力。运动自动解码器是一种轻量级架构,能够从差分特征中重建像素级运动,从而将运动动态显式嵌入网络。通过这种方式,MoLo能够同时学习长程时域上下文和运动线索,实现全面的少样本匹配。为验证有效性,我们在五个标准基准上评估了MoLo,结果表明其性能显著优于最新先进方法。源代码已开源至https://github.com/alibaba-mmai-research/MoLo。