Designing front-ends for speech deepfake detectors primarily focuses on two categories. Hand-crafted filterbank features are transparent but are limited in capturing high-level semantic details, often resulting in performance gaps compared to self-supervised (SSL) features. SSL features, in turn, lack interpretability and may overlook fine-grained spectral anomalies. We propose the WST-X series, a novel family of feature extractors that combines the best of both worlds via the wavelet scattering transform (WST), integrating wavelets with nonlinearities analogous to deep convolutional networks. We investigate 1D and 2D WSTs to extract acoustic details and higher-order structural anomalies, respectively. Experimental results on the recent and challenging Deepfake-Eval-2024 dataset indicate that WST-X outperforms existing front-ends by a wide margin. Our analysis reveals that a small averaging scale ($J$), combined with high-frequency and directional resolutions ($Q, L$), is critical for capturing subtle artifacts. This underscores the value of translation-invariant and deformation-stable features for robust and interpretable speech deepfake detection.
翻译:语音深度伪造检测器的前端设计主要聚焦于两类方法。手工设计的滤波器组特征具有透明性,但在捕捉高层语义细节方面存在局限,通常导致其性能与自监督(SSL)特征相比存在差距。而SSL特征则缺乏可解释性,且可能忽略细粒度的频谱异常。我们提出了WST-X系列,这是一个新颖的特征提取器家族,它通过小波散射变换(WST)结合了两种方法的优点,将小波与类似于深度卷积网络的非线性变换相融合。我们分别研究了一维和二维WST,以提取声学细节和高阶结构异常。在近期具有挑战性的Deepfake-Eval-2024数据集上的实验结果表明,WST-X大幅优于现有的前端方法。我们的分析表明,较小的平均尺度($J$)结合高频和方向分辨率($Q, L$)对于捕捉细微伪影至关重要。这凸显了平移不变性和形变稳定性特征对于实现鲁棒且可解释的语音深度伪造检测的价值。