Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

In the research field of few-shot learning, the main difference between image-based and video-based is the additional temporal dimension for videos. In recent years, many approaches for few-shot action recognition have followed the metric-based methods, especially, since some works use the Transformer to get the cross-attention feature of the videos or the enhanced prototype, and the results are competitive. However, they do not mine enough information from the Transformer because they only focus on the feature of a single level. In our paper, we have addressed this problem. We propose an end-to-end method named "Task-Specific Alignment and Multiple Level Transformer Network (TSA-MLT)". In our model, the Multiple Level Transformer focuses on the multiple-level feature of the support video and query video. Especially before Multiple Level Transformer, we use task-specific TSA to filter unimportant or misleading frames as a pre-processing. Furthermore, we adopt a fusion loss using two kinds of distance, the first is L2 sequence distance, which focuses on temporal order alignment. The second one is Optimal transport distance, which focuses on measuring the gap between the appearance and semantics of the videos. Using a simple fusion network, we fuse the two distances element-wise, then use the cross-entropy loss as our fusion loss. Extensive experiments show our method achieves state-of-the-art results on the HMDB51 and UCF101 datasets and a competitive result on the benchmark of Kinetics and something-2-something V2 datasets. Our code will be available at the URL: https://github.com/cofly2014/tsa-mlt.git

翻译：在小样本学习研究领域，基于图像与基于视频的方法的主要区别在于视频额外的时间维度。近年来，许多小样本动作识别方法采用基于度量的方法，尤其是一些研究利用Transformer获取视频的跨注意力特征或增强原型，并取得了具有竞争力的结果。然而，这些方法仅关注单层级特征，未能充分挖掘Transformer中的信息。本文针对该问题提出了一种端到端方法，命名为"任务特定对齐与多层级Transformer网络（TSA-MLT）"。在我们的模型中，多层级Transformer聚焦于支持视频与查询视频的多层级特征。特别是在多层级Transformer之前，我们使用任务特定的TSA作为预处理步骤过滤不重要或具有误导性的帧。此外，我们采用融合损失，包含两类距离：第一类是L2序列距离，侧重于时间顺序对齐；第二类是最优传输距离，用于衡量视频外观与语义之间的差距。通过简单的融合网络逐元素融合两类距离后，采用交叉熵损失作为融合损失。大量实验表明，我们的方法在HMDB51和UCF101数据集上取得了最先进的性能，在Kinetics和Something-Something V2数据集基准上亦获得了具有竞争力的结果。代码将开源至：https://github.com/cofly2014/tsa-mlt.git