In this paper, we focus on audio violence detection (AVD). AVD is necessary for several reasons, especially in the context of maintaining safety, preventing harm, and ensuring security in various environments. This calls for accurate AVD systems. Like many related applications in audio processing, the most common approach for improving the performance, would be by leveraging self-supervised (SSL) pre-trained models (PTMs). However, as these SSL models are very large models with million of parameters and this can hinder real-world deployment especially in compute-constraint environment. To resolve this, we propose the usage of speaker recognition models which are much smaller compared to the SSL models. Experimentation with speaker recognition model embeddings with SVM & Random Forest as classifiers, we show that speaker recognition model embeddings perform the best in comparison to state-of-the-art (SOTA) SSL models and achieve SOTA results.
翻译:本文聚焦于音频暴力检测(AVD)。AVD的必要性体现在多个方面,尤其是在维护安全、预防伤害及保障各类环境安全性的背景下。这要求开发精确的AVD系统。与音频处理中许多相关应用类似,提升性能最常见的方法是采用自监督预训练模型。然而,由于这些自监督模型参数量高达数百万,属于超大规模模型,在计算资源受限的环境中可能阻碍实际部署。为解决此问题,我们提出使用规模远小于自监督模型的说话人识别模型。通过将说话人识别模型嵌入与支持向量机和随机森林分类器结合进行实验,我们证明说话人识别模型嵌入的性能优于当前最先进的自监督模型,并取得了最先进的结果。