In the research field of few-shot learning, the main difference between image-based and video-based is the additional temporal dimension. In recent years, some works have used the Transformer to deal with frames, then get the attention feature and the enhanced prototype, and the results are competitive. However, some video frames may relate little to the action, and only using single frame-level or segment-level features may not mine enough information. We address these problems sequentially through an end-to-end method named "Task-Specific Alignment and Multiple-level Transformer Network (TSA-MLT)". The first module (TSA) aims at filtering the action-irrelevant frames for action duration alignment. Affine Transformation for frame sequence in the time dimension is used for linear sampling. The second module (MLT) focuses on the Multiple-level feature of the support prototype and query sample to mine more information for the alignment, which operates on different level features. We adopt a fusion loss according to a fusion distance that fuses the L2 sequence distance, which focuses on temporal order alignment, and the Optimal Transport distance, which focuses on measuring the gap between the appearance and semantics of the videos. Extensive experiments show our method achieves state-of-the-art results on the HMDB51 and UCF101 datasets and a competitive result on the benchmark of Kinetics and something 2-something V2 datasets. Our code is available at the URL: https://github.com/cofly2014/tsa-mlt.git
翻译:在小样本学习研究领域中,基于图像与视频方法的主要区别在于额外的时序维度。近年来,部分工作采用Transformer处理帧序列,获取注意力特征与增强原型,取得了具有竞争力的结果。然而,部分视频帧可能与动作关联性较弱,且仅使用单一级别的帧级或片段级特征可能无法充分挖掘信息。我们通过端到端方法“任务特定对齐与多层次Transformer网络(TSA-MLT)”依次解决这些问题。第一个模块(TSA)旨在过滤与动作无关的帧以实现动作持续时间对齐,采用时间维度的帧序列仿射变换进行线性采样。第二个模块(MLT)聚焦于支持原型与查询样本的多层次特征,通过对齐操作在不同层次特征上挖掘更多信息。我们根据融合距离采用融合损失,该距离融合了关注时序顺序对齐的L2序列距离,与衡量视频外观与语义差距的最优传输距离。大量实验表明,我们的方法在HMDB51和UCF101数据集上达到最先进结果,在Kinetics与Something-Something V2基准上取得具有竞争力的表现。我们的代码已开源:https://github.com/cofly2014/tsa-mlt.git