The state-of-the-art audio deepfake detectors leveraging deep neural networks exhibit impressive recognition performance. Nonetheless, this advantage is accompanied by a significant carbon footprint. This is mainly due to the use of high-performance computing with accelerators and high training time. Studies show that average deep NLP model produces around 626k lbs of CO\textsubscript{2} which is equivalent to five times of average US car emission at its lifetime. This is certainly a massive threat to the environment. To tackle this challenge, this study presents a novel framework for audio deepfake detection that can be seamlessly trained using standard CPU resources. Our proposed framework utilizes off-the-shelve self-supervised learning (SSL) based models which are pre-trained and available in public repositories. In contrast to existing methods that fine-tune SSL models and employ additional deep neural networks for downstream tasks, we exploit classical machine learning algorithms such as logistic regression and shallow neural networks using the SSL embeddings extracted using the pre-trained model. Our approach shows competitive results compared to the commonly used high-carbon footprint approaches. In experiments with the ASVspoof 2019 LA dataset, we achieve a 0.90\% equal error rate (EER) with less than 1k trainable model parameters. To encourage further research in this direction and support reproducible results, the Python code will be made publicly accessible following acceptance\footnote{\href{https://github.com/sahasubhajit/Speech-Spoofing-}{GitHub link}}.
翻译:最先进的利用深度神经网络的音频深度伪造检测器展现出卓越的识别性能。然而,这种优势伴随着显著的碳排放足迹。这主要归因于使用带有加速器的高性能计算以及较长的训练时间。研究表明,平均深度NLP模型会排放约62.6万磅二氧化碳,相当于美国普通汽车整个使用寿命期间排放量的五倍。这无疑对环境构成了巨大威胁。为应对这一挑战,本研究提出了一种新颖的音频深度伪造检测框架,该框架可利用标准CPU资源无缝训练。我们提出的框架利用了现成的自监督学习模型,这些模型经过预训练并可在公共仓库中获取。与现有方法对SSL模型进行微调并采用额外的深度神经网络处理下游任务不同,我们利用预训练模型提取的SSL嵌入,采用逻辑回归和浅层神经网络等经典机器学习算法。我们的方法在效果上与常用的高碳足迹方法相比具有竞争力。在使用ASVspoof 2019 LA数据集的实验中,我们实现了0.90%的等错误率,且可训练模型参数少于1k。为鼓励该方向的进一步研究并支持结果可复现,Python代码将在论文接收后公开提供(GitHub链接)。