Spoofed speech detection is increasingly challenged by realistic synthesis, voice conversion, and replay attacks, with cross-dataset generalization remaining a major limitation. This work we propose a Temporal Pyramid Adapter that utilize parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues, ranging from local artifacts to global prosodic irregularities. We also integrated self-supervised XLS-R representations combined with front-end adapters, including Mel, Sinc, and a Temporal Pyramid design for multi-scale temporal modeling. The proposed model is evaluated cross multiple benchmark including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Experimental results demonstrate that Temporal Pyramid model obtained AUC of 99.24% and a EER of 3.87% on the PartialSpoof database, which is significantly outperforming the base model and several SOTA baseline such as LCNN-BLSTM (9.87% EER) and TRACE (8.08% EER). Additionally, multilingual evaluations confirm that while spoofing artifact are independent from language. While self-supervised representations improve robustness, performance degrades under domain and language shifts, highlighting the need for better adaptation and calibration strategies.
翻译:伪播语音检测正面临来自真实感合成、语音转换和重放攻击的日益严峻挑战,其中跨数据集泛化能力不足仍是主要瓶颈。本文提出一种时间金字塔适配器,通过采用具有不同感受野的并行时间卷积,从局部伪影到全局韵律异常,捕获多尺度的欺骗线索。我们还将自监督XLS-R表示与前端适配器(包括梅尔频谱、正弦滤波器以及用于多尺度时间建模的时间金字塔结构)进行整合。所提模型在ASVspoof 2017、ASVspoof 2021(DF/LA)、PartialSpoof、DiffSSD以及多语种HQ-MPSD等多个基准数据集上进行了评估。实验结果表明,时间金字塔模型在PartialSpoof数据库上取得了99.24%的AUC和3.87%的等错误率(EER),显著优于基础模型及多个最新基线方法,如LCNN-BLSTM(9.87% EER)和TRACE(8.08% EER)。此外,多语种评估证实,尽管欺骗伪影与语言无关,但自监督表示的引入虽提升了鲁棒性,在领域和语言偏移情况下性能仍会下降,这凸显了制定更优自适应与校准策略的必要性。