Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.
翻译:动作识别是构建交互式元宇宙的关键技术。随着深度学习的快速发展,动作识别方法也取得了重大突破。研究者从多视角出发设计并实现骨干网络,导致方法多样性并面临新挑战。本文综述了基于深度神经网络的若干动作识别方法,从三方面进行介绍:1)双流网络及其变体,具体采用RGB视频帧和光流模态作为输入;2)三维卷积网络,致力于直接利用RGB模态优势,无需单独提取运动信息;3)基于Transformer的方法,将自然语言处理模型引入计算机视觉与视频理解领域。本文提供客观视角,期望为未来研究提供参考。