Modern front-end design for speech deepfake detection relies on full fine-tuning of large pre-trained models like XLSR. However, this approach is not parameter-efficient and may lead to suboptimal generalization to realistic, in-the-wild data types. To address these limitations, we introduce a new family of parameter-efficient front-ends that fuse prompt-tuning with classical signal processing transforms. These include FourierPT-XLSR, which uses the Fourier Transform, and two variants based on the Wavelet Transform: WSPT-XLSR and Partial-WSPT-XLSR. We further propose WaveSP-Net, a novel architecture combining a Partial-WSPT-XLSR front-end and a bidirectional Mamba-based back-end. This design injects multi-resolution features into the prompt embeddings, which enhances the localization of subtle synthetic artifacts without altering the frozen XLSR parameters. Experimental results demonstrate that WaveSP-Net outperforms several state-of-the-art models on two new and challenging benchmarks, Deepfake-Eval-2024 and SpoofCeleb, with low trainable parameters and notable performance gains. The code and models are available at https://github.com/xxuan-acoustics/WaveSP-Net.
翻译:当前语音深度伪造检测的前端设计依赖于对XLSR等大型预训练模型进行全量微调。然而,这种方法参数效率不高,且可能导致对现实世界中真实数据类型的泛化能力欠佳。为应对这些局限性,我们提出了一系列参数高效的新型前端,它们将提示调优与经典信号处理变换相融合。其中包括使用傅里叶变换的FourierPT-XLSR,以及两种基于小波变换的变体:WSPT-XLSR和Partial-WSPT-XLSR。我们进一步提出了WaveSP-Net,这是一种新颖的架构,结合了Partial-WSPT-XLSR前端和基于双向Mamba的后端。该设计将多分辨率特征注入到提示嵌入中,从而在不改变冻结的XLSR参数的情况下,增强了对细微合成伪影的定位能力。实验结果表明,在两个新颖且具有挑战性的基准测试集Deepfake-Eval-2024和SpoofCeleb上,WaveSP-Net以较少的可训练参数和显著的性能提升,超越了多个最先进的模型。代码与模型可在 https://github.com/xxuan-acoustics/WaveSP-Net 获取。