ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations; hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VIT excels at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data. The official website of this work is available at: https://github.com/rana2149/ActNetFormer.

翻译：视频中的人体动作或活动识别是计算机视觉的基本任务，在监控安防、自动驾驶、体育分析、人机交互等领域具有广泛应用。传统监督方法需要大量标注数据集进行训练，而这些数据的获取成本高昂且耗时。本文提出了一种新颖的跨架构伪标签结合对比学习的半监督动作识别方法。我们的框架同时利用标注和未标注数据，通过伪标签与对比学习的协同机制，稳健地学习视频中的动作表征。我们引入创新的跨架构方法，利用三维卷积神经网络（3D CNN）和视频变换器（VIT）分别捕获动作表征的不同维度，故命名为ActNetFormer。3D CNN擅长提取空间特征和时间域局部依赖关系，而VIT则擅长捕捉帧间的长距离依赖。通过将这些互补架构集成到ActNetFormer框架中，我们的方法能够有效捕获动作的局部与全局上下文信息。这种综合表征学习使模型能够充分利用各架构优势，在半监督动作识别任务中取得更优性能。在标准动作识别数据集上的实验结果表明，本方法在仅使用少量标注数据的情况下性能优于现有方法，达到了最先进的水平。本工作的官方网址为：https://github.com/rana2149/ActNetFormer。