Human action recognition (HAR) is a high-level and significant research area in computer vision due to its ubiquitous applications. The main limitations of the current HAR models are their complex structures and lengthy training time. In this paper, we propose a simple yet versatile and effective end-to-end deep learning architecture, coined as TransNet, for HAR. TransNet decomposes the complex 3D-CNNs into 2D- and 1D-CNNs, where the 2D- and 1D-CNN components extract spatial features and temporal patterns in videos, respectively. Benefiting from its concise architecture, TransNet is ideally compatible with any pretrained state-of-the-art 2D-CNN models in other fields, being transferred to serve the HAR task. In other words, it naturally leverages the power and success of transfer learning for HAR, bringing huge advantages in terms of efficiency and effectiveness. Extensive experimental results and the comparison with the state-of-the-art models demonstrate the superior performance of the proposed TransNet in HAR in terms of flexibility, model complexity, training speed and classification accuracy.
翻译:人体动作识别(HAR)是计算机视觉中一个高层次且重要的研究领域,因其广泛的应用场景而备受关注。当前HAR模型的主要局限在于其复杂的结构和冗长的训练时间。本文提出了一种简洁、通用且高效的端到端深度学习架构——TransNet,用于解决HAR问题。TransNet将复杂的3D-CNN解耦为2D-CNN和1D-CNN组件,其中2D-CNN提取视频中的空间特征,1D-CNN提取时间模式。得益于其简洁的架构,TransNet能够高度兼容其他领域的预训练先进2D-CNN模型,通过迁移学习服务于HAR任务。换言之,它自然利用了迁移学习的优势与成功经验,为HAR带来了效率和效果上的巨大优势。大量实验结果以及与最先进模型的对比表明,TransNet在灵活性、模型复杂度、训练速度和分类精度方面均展现出卓越性能。