In this paper, we propose a deep neural network approach for deepfake speech detection (DSD) based on a lowcomplexity Depthwise-Inception Network (DIN) trained with a contrastive training strategy (CTS). In this framework, input audio recordings are first transformed into spectrograms using Short-Time Fourier Transform (STFT) and Linear Filter (LF), which are then used to train the DIN. Once trained, the DIN processes bonafide utterances to extract audio embeddings, which are used to construct a Gaussian distribution representing genuine speech. Deepfake detection is then performed by computing the distance between a test utterance and this distribution to determine whether the utterance is fake or bonafide. To evaluate our proposed systems, we conducted extensive experiments on the benchmark dataset of ASVspoof 2019 LA. The experimental results demonstrate the effectiveness of combining the Depthwise-Inception Network with the contrastive learning strategy in distinguishing between fake and bonafide utterances. We achieved Equal Error Rate (EER), Accuracy (Acc.), F1, AUC scores of 4.6%, 95.4%, 97.3%, and 98.9% respectively using a single, low-complexity DIN with just 1.77 M parameters and 985 M FLOPS on short audio segments (4 seconds). Furthermore, our proposed system outperforms the single-system submissions in the ASVspoof 2019 LA challenge, showcasing its potential for real-time applications.
翻译:本文提出了一种基于低复杂度深度可分离-初始神经网络(DIN)并结合对比训练策略(CTS)的深度伪造语音检测(DSD)方法。在此框架中,输入音频首先通过短时傅里叶变换(STFT)和线性滤波器(LF)转换为频谱图,随后用于训练DIN。训练完成后,DIN处理真实语音以提取音频嵌入,并基于此构建代表真实语音的高斯分布。检测时通过计算测试语音与该分布的距离,判定其是否为伪造语音。为评估所提系统,我们在ASVspoof 2019 LA基准数据集上进行了大量实验。结果表明,深度可分离-初始网络与对比学习策略的结合能有效区分伪造与真实语音。使用仅含1.77 M参数、985 M FLOPS的单一低复杂度DIN模型,在短音频片段(4秒)上取得了等错误率(EER)、准确率(Acc.)、F1分数和AUC分别为4.6%、95.4%、97.3%和98.9%的性能。此外,所提系统在ASVspoof 2019 LA挑战赛中优于所有单一系统提交结果,展现了其在实时应用中的潜力。