Real-time video surveillance, through CCTV camera systems has become essential for ensuring public safety which is a priority today. Although CCTV cameras help a lot in increasing security, these systems require constant human interaction and monitoring. To eradicate this issue, intelligent surveillance systems can be built using deep learning video classification techniques that can help us automate surveillance systems to detect violence as it happens. In this research, we explore deep learning video classification techniques to detect violence as they are happening. Traditional image classification techniques fall short when it comes to classifying videos as they attempt to classify each frame separately for which the predictions start to flicker. Therefore, many researchers are coming up with video classification techniques that consider spatiotemporal features while classifying. However, deploying these deep learning models with methods such as skeleton points obtained through pose estimation and optical flow obtained through depth sensors, are not always practical in an IoT environment. Although these techniques ensure a higher accuracy score, they are computationally heavier. Keeping these constraints in mind, we experimented with various video classification and action recognition techniques such as ConvLSTM, LRCN (with both custom CNN layers and VGG-16 as feature extractor) CNNTransformer and C3D. We achieved a test accuracy of 80% on ConvLSTM, 83.33% on CNN-BiLSTM, 70% on VGG16-BiLstm ,76.76% on CNN-Transformer and 80% on C3D.
翻译:通过闭路电视监控系统进行实时视频监控已成为确保公共安全的重要举措,这是当前工作的优先事项。尽管闭路电视摄像机在提升安全性方面发挥重要作用,但此类系统仍需持续的人工交互与监控。为解决此问题,可利用深度学习视频分类技术构建智能监控系统,实现暴力事件的自动化实时检测。本研究探索了深度学习视频分类技术在暴力事件实时检测中的应用。传统图像分类技术因逐帧独立分类导致预测结果闪烁,在视频分类任务中表现不足。因此,众多研究者提出考虑时空特征的视频分类技术。但通过姿态估计获取骨骼点、利用深度传感器提取光流等方法的深度学习模型,在物联网环境中部署仍存在局限性。尽管这类技术能保证更高准确率,但其计算负载较大。基于上述约束,我们实验了多种视频分类与动作识别技术,包括ConvLSTM、LRCN(分别采用自定义CNN层和VGG-16作为特征提取器)、CNN-Transformer及C3D。最终测试准确率分别为:ConvLSTM达80%,CNN-BiLSTM达83.33%,VGG16-BiLSTM达70%,CNN-Transformer达76.76%,C3D达80%。