Deep learning has achieved great success in video recognition, yet still struggles to recognize novel actions when faced with only a few examples. To tackle this challenge, few-shot action recognition methods have been proposed to transfer knowledge from a source dataset to a novel target dataset with only one or a few labeled videos. However, existing methods mainly focus on modeling the temporal relations between the query and support videos while ignoring the spatial relations. In this paper, we find that the spatial misalignment between objects also occurs in videos, notably more common than the temporal inconsistency. We are thus motivated to investigate the importance of spatial relations and propose a more accurate few-shot action recognition method that leverages both spatial and temporal information. Particularly, a novel Spatial Alignment Cross Transformer (SA-CT) which learns to re-adjust the spatial relations and incorporates the temporal information is contributed. Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks. To further incorporate the temporal information, we propose a simple yet effective Temporal Mixer module. The Temporal Mixer enhances the video representation and improves the performance of the full SA-CT model, achieving very competitive results. In this work, we also exploit large-scale pretrained models for few-shot action recognition, providing useful insights for this research direction.
翻译:深度学习在视频识别领域取得了巨大成功,但在仅面对少量示例时,仍难以识别新颖动作。为应对这一挑战,研究者提出了少样本动作识别方法,旨在从源数据集向仅含有一个或少量标注视频的新目标数据集迁移知识。然而,现有方法主要聚焦于建模查询视频与支持视频之间的时间关系,却忽略了空间关系。本文发现,视频中目标间的空间错位现象同样普遍存在,且其出现频率显著高于时间不一致性。基于此,我们探究空间关系的重要性,并提出一种更精确的少样本动作识别方法,该方法同时利用空间与时间信息。具体而言,我们贡献了一种新颖的空间对齐交叉Transformer(SA-CT),它能够学习重新调整空间关系并融合时间信息。实验表明,即便完全不使用时间信息,SA-CT在3/4基准上的表现仍可与基于时间的方法相媲美。为进一步融入时间信息,我们提出了一种简单而有效的时间混合器模块。该时间混合器增强了视频表征,并提升了完整SA-CT模型的性能,取得了极具竞争力的结果。此外,本文还探索了大规模预训练模型在少样本动作识别中的应用,为该研究方向提供了有益见解。