Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition

This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.

翻译：本文通过知识蒸馏方法及CNN与ViT模型的结合，对人体动作识别技术进行改进研究。研究旨在将大型教师模型的知识迁移至小型学生模型，从而提升后者性能与效率。所提方法采用Transformer视觉网络作为学生模型，卷积网络作为教师模型。教师模型负责提取局部图像特征，而学生模型通过注意力机制聚焦全局特征。Vision Transformer (ViT)架构被引入作为捕捉图像全局依赖关系的稳健框架，同时探讨了PVT、Convit、MVIT、Swin Transformer及Twins等ViT先进变体在计算机视觉任务中的贡献。ConvNeXt模型作为教师模型被引入，该模型以高效性和有效性著称。文章基于Stanford 40数据集展示了人体动作识别的性能结果，对比了有无知识蒸馏训练下学生模型的准确率与平均精度均值(mAP)。实验结果表明，相较于常规训练方法，所提方法显著提升了准确率与mAP。这些发现凸显了在动作识别任务中融合局部与全局特征的潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日