Currently, video behavior recognition is one of the most foundational tasks of computer vision. The 2D neural networks of deep learning are built for recognizing pixel-level information such as images with RGB, RGB-D, or optical flow formats, with the current increasingly wide usage of surveillance video and more tasks related to human action recognition. There are increasing tasks requiring temporal information for frames dependency analysis. The researchers have widely studied video-based recognition rather than image-based(pixel-based) only to extract more informative elements from geometry tasks. Our current related research addresses multiple novel proposed research works and compares their advantages and disadvantages between the derived deep learning frameworks rather than machine learning frameworks. The comparison happened between existing frameworks and datasets, which are video format data only. Due to the specific properties of human actions and the increasingly wide usage of deep neural networks, we collected all research works within the last three years between 2020 to 2022. In our article, the performance of deep neural networks surpassed most of the techniques in the feature learning and extraction tasks, especially video action recognition.
翻译:目前,视频行为识别是计算机视觉中最基础的任务之一。深度学习的二维神经网络专门用于识别RGB、RGB-D或光流格式的像素级信息(如图像),随着监控视频及与人体动作识别相关任务的日益广泛应用,越来越多的任务需要利用时间信息进行帧间依赖性分析。研究者们已广泛研究基于视频而非仅基于图像(像素级)的识别方法,以期从几何任务中提取更多信息性元素。我们当前的相关研究涵盖多项新提出的研究工作,并在衍生深度学习框架(而非机器学习框架)之间比较其优缺点。这些比较仅基于现有框架与视频格式数据集。由于人体动作的特殊性质及深度神经网络的日益广泛应用,我们收集了2020年至2022年间近三年的所有研究工作。本文中,深度神经网络在特征学习与提取任务中的性能超越了大多数技术,尤其在视频动作识别领域表现突出。