This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.
翻译:本文通过知识蒸馏方法及CNN与ViT模型的结合,对人体动作识别技术进行改进研究。研究旨在将大型教师模型的知识迁移至小型学生模型,从而提升后者性能与效率。所提方法采用Transformer视觉网络作为学生模型,卷积网络作为教师模型。教师模型负责提取局部图像特征,而学生模型通过注意力机制聚焦全局特征。Vision Transformer (ViT)架构被引入作为捕捉图像全局依赖关系的稳健框架,同时探讨了PVT、Convit、MVIT、Swin Transformer及Twins等ViT先进变体在计算机视觉任务中的贡献。ConvNeXt模型作为教师模型被引入,该模型以高效性和有效性著称。文章基于Stanford 40数据集展示了人体动作识别的性能结果,对比了有无知识蒸馏训练下学生模型的准确率与平均精度均值(mAP)。实验结果表明,相较于常规训练方法,所提方法显著提升了准确率与mAP。这些发现凸显了在动作识别任务中融合局部与全局特征的潜力。