Real-Time Human Detection for Aerial Captured Video Sequences via Deep Models

Human detection in videos plays an important role in various real-life applications. Most traditional approaches depend on utilizing handcrafted features, which are problem-dependent and optimal for specific tasks. Moreover, they are highly susceptible to dynamical events such as illumination changes, camera jitter, and variations in object sizes. On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge. In this paper, we utilize automatic feature learning methods, which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes. The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset. The comparison between these models in terms of training, testing accuracy, and learning speed is analyzed. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that the proposed methods are successful for the human detection task. The pretrained CNN produces an average accuracy of 98.09%. S-CNN produces an average accuracy of 95.6% with softmax and 91.7% with Support Vector Machines (SVM). H-ELM has an average accuracy of 95.9%. Using a normal Central Processing Unit (CPU), H-ELM's training time takes 445 seconds. Learning in S-CNN takes 770 seconds with a high-performance Graphical Processing Unit (GPU).

翻译：视频中的人体检测在各种现实应用中扮演着重要角色。大多数传统方法依赖于利用手工设计的特征，这些特征是问题相关的，且仅针对特定任务最优。此外，它们对光照变化、相机抖动和物体尺寸变化等动态事件高度敏感。另一方面，所提出的特征学习方法成本更低且更简便，因为无需专家知识即可自动生成高度抽象和判别性的特征。本文利用自动特征学习方法，结合光流和三种不同的深度模型（即监督卷积神经网络（S-CNN）、预训练CNN特征提取器和分层极限学习机），用于在高度变化的空中平台上使用非静态相机捕获的视频中进行人体检测。模型在公开可用且极具挑战性的UCF-ARG航拍数据集上进行训练和测试。分析了这些模型在训练、测试准确率和学习速度方面的比较。性能评估考虑了五种人体动作（挖掘、挥手、投掷、行走和奔跑）。实验结果表明，所提出的方法在人体检测任务中是成功的。预训练CNN的平均准确率达到98.09%。S-CNN在使用softmax时平均准确率为95.6%，使用支持向量机（SVM）时为91.7%。H-ELM的平均准确率为95.9%。使用普通中央处理器（CPU）时，H-ELM的训练时间为445秒。S-CNN在采用高性能图形处理器（GPU）时学习耗时770秒。