Semi-supervised action recognition is a challenging but critical task due to the high cost of video annotations. Existing approaches mainly use convolutional neural networks, yet current revolutionary vision transformer models have been less explored. In this paper, we investigate the use of transformer models under the SSL setting for action recognition. To this end, we introduce SVFormer, which adopts a steady pseudo-labeling framework (ie, EMA-Teacher) to cope with unlabeled video samples. While a wide range of data augmentations have been shown effective for semi-supervised image classification, they generally produce limited results for video recognition. We therefore introduce a novel augmentation strategy, Tube TokenMix, tailored for video data where video clips are mixed via a mask with consistent masked tokens over the temporal axis. In addition, we propose a temporal warping augmentation to cover the complex temporal variation in videos, which stretches selected frames to various temporal durations in the clip. Extensive experiments on three datasets Kinetics-400, UCF-101, and HMDB-51 verify the advantage of SVFormer. In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400. Our method can hopefully serve as a strong benchmark and encourage future search on semi-supervised action recognition with Transformer networks.
翻译:半监督动作识别是一项具有挑战性但至关重要的任务,原因在于视频标注成本高昂。现有方法主要使用卷积神经网络,而当前革命性的视觉Transformer模型研究较少。本文探讨了在SSL场景下利用Transformer模型进行动作识别。为此,我们提出SVFormer,该模型采用稳定的伪标签框架(即EMA-Teacher)以处理未标注视频样本。尽管多种数据增强方法在半监督图像分类中效果显著,但它们在视频识别中通常表现有限。因此,我们提出一种针对视频数据的新型增强策略——Token混合管道(Tube TokenMix),该方法通过沿时间轴使用一致的掩码令牌混合视频片段。此外,我们提出一种时间扭曲增强方法以覆盖视频中复杂的时间变化,该方法可将选定帧拉伸至片段内的不同时间长度。在Kinetics-400、UCF-101和HMDB-51三个数据集上的大量实验验证了SVFormer的优势。特别是在Kinetics-400的1%标注率下,SVFormer以更少的训练轮次超越了现有最先进方法31.5%。我们的方法有望成为一项强基准,并推动未来基于Transformer网络的半监督动作识别研究。