SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.

翻译：高帧率动作识别视频在提升细粒度表达能力的同时，降低了时空关系与运动信息的密度。因此，传统数据驱动训练方法需要持续获取大量视频样本。然而，现实场景中样本往往并不充足，这推动了少样本动作识别研究的发展。我们观察到，当前多数少样本动作识别方法在空间特征提取后通过时序对齐构建视频样本的时空关系，从而割裂了样本内部的空间与时间特征。这些方法还通过相邻帧间的狭窄视角捕获运动信息，未考虑信息密度，导致运动信息捕获不充分。为此，本文提出一种新颖的即插即用式少样本动作识别架构——时空帧元组增强器。基于该架构设计的模型称为SOAP-Net。该模型不仅考虑不同特征通道间的时间关联性及特征的时空关系，还通过包含比相邻帧更丰富运动信息的多帧元组捕获全面运动信息。结合不同帧数的帧元组进一步提供了更广阔的视角。SOAP-Net在SthSthV2、Kinetics、UCF101和HMDB51等知名基准测试中均取得了最先进的性能。大量实证评估验证了SOAP在竞争力、可插拔性、泛化能力和鲁棒性方面的优势。代码已发布于https://github.com/wenbohuang1002/SOAP。