Few-shot action recognition, i.e. recognizing new action classes given only a few examples, benefits from incorporating temporal information. Prior work either encodes such information in the representation itself and learns classifiers at test time, or obtains frame-level features and performs pairwise temporal matching. We first evaluate a number of matching-based approaches using features from spatio-temporal backbones, a comparison missing from the literature, and show that the gap in performance between simple baselines and more complicated methods is significantly reduced. Inspired by this, we propose Chamfer++, a non-temporal matching function that achieves state-of-the-art results in few-shot action recognition. We show that, when starting from temporal features, our parameter-free and interpretable approach can outperform all other matching-based and classifier methods for one-shot action recognition on three common datasets without using temporal information in the matching stage. Project page: https://jbertrand89.github.io/matching-based-fsar
翻译:小样本动作识别,即仅通过少量示例识别新动作类别,得益于对时间信息的整合。以往的工作要么在表示本身中编码此类信息并在测试时学习分类器,要么获取帧级特征并执行成对的时间匹配。我们首先评估了多种基于匹配的方法,这些方法使用了来自时空骨干网络的特征——这是文献中缺失的比较——并表明简单基线与更复杂方法之间的性能差距显著缩小。受此启发,我们提出了Chamfer++,一种非时间匹配函数,在小样本动作识别中取得了最先进的结果。我们表明,当从时间特征出发时,这种无参数且可解释的方法能够在三个常见数据集上,在匹配阶段不使用时间信息的情况下,超越所有其他基于匹配和分类器的方法,实现单样本动作识别。项目页面:https://jbertrand89.github.io/matching-based-fsar