In every connected smart city around the world, CCTVs have played a pivotal role in enforcing the safety and security of the citizens by recording unlawful activities for the authorities to take action. To ensure the efficiency and effectiveness of CCTVs in this domain, different DNN architectures were created and used by researchers and developers to either detect violence or detect weapons using bounding boxes or masks. These weapons are limited to guns, knives, and other obvious handheld weapons. To remove these limits and detect weapons more efficiently, non-weaponized violence footage from CCTV must be differentiable from weaponized ones. Since there are no current datasets that are tailored to this purpose of generalizability in weaponized violence detection, we introduced a new dataset that contains videos depicting weaponized violence, non-weaponized violence, and non-violent events. We also propose a novel data-centric method that arranges video frames into salient images while minimizing information loss for comfortable inference by SOTA image classifiers. This was done to simplify video classification tasks and optimize inference latency to improve sustainability in smart cities. Our experiments show that Image Classifiers can efficiently detect and distinguish violence with weapons from violence without weapons with performances as high as 99\% on our dataset, which are comparable with current SOTA 3D networks for action recognition and video classification.
翻译:在全球各互联智慧城市中,闭路电视通过记录违法行为供执法机构采取措施,对保障市民安全发挥了关键作用。为确保闭路电视在此领域的效率与效能,研究人员和开发者创建了多种深度神经网络架构,通过边界框或掩膜来检测暴力行为或识别武器。这些武器通常局限于枪支、刀具及其他明显的手持武器。为突破此类限制并更高效地检测武器,必须将闭路电视拍摄的非武装暴力画面与武装暴力画面区分开来。由于目前尚无专门用于武装暴力检测泛化任务的数据集,我们引入了一个包含武装暴力、非武装暴力及非暴力事件视频的新数据集。同时提出了一种创新的数据驱动方法,该方法以减少信息损失的方式将视频帧排列成显著图像,便于现有最优图像分类器进行高效推理。这旨在简化视频分类任务并优化推理延迟,从而提升智慧城市的可持续性。实验表明,图像分类器能够有效检测并区分持械暴力与非持械暴力,在我们的数据集上性能高达99%,与当前用于动作识别和视频分类的3D网络性能相当。