This paper describes our submitted systems to the ASVspoof 5 Challenge Track 1: Speech Deepfake Detection - Open Condition, which consists of a stand-alone speech deepfake (bonafide vs spoof) detection task. Recently, large-scale self-supervised models become a standard in Automatic Speech Recognition (ASR) and other speech processing tasks. Thus, we leverage a pre-trained WavLM as a front-end model and pool its representations with different back-end techniques. The complete framework is fine-tuned using only the trained dataset of the challenge, similar to the close condition. Besides, we adopt data-augmentation by adding noise and reverberation using MUSAN noise and RIR datasets. We also experiment with codec augmentations to increase the performance of our method. Ultimately, we use the Bosaris toolkit for score calibration and system fusion to get better Cllr scores. Our fused system achieves 0.0937 minDCF, 3.42% EER, 0.1927 Cllr, and 0.1375 actDCF.
翻译:本文描述了我们在ASVspoof 5挑战赛第一赛道"开放条件下的语音深度伪造检测"中提交的系统,该任务包含独立的语音深度伪造(真实语音与欺骗语音)检测。近年来,大规模自监督模型已成为自动语音识别(ASR)及其他语音处理任务的标准方案。因此,我们采用预训练的WavLM作为前端模型,并通过不同的后端技术对其表征进行池化处理。整个框架仅使用挑战赛的训练数据集进行微调,类似于封闭条件设置。此外,我们采用数据增强技术,通过MUSAN噪声数据集和RIR数据集添加噪声和混响效果。我们还尝试了编解码器增强方法以提升系统性能。最终,我们使用Bosaris工具包进行分数校准和系统融合,以获得更优的Cllr分数。我们的融合系统实现了0.0937 minDCF、3.42% EER、0.1927 Cllr和0.1375 actDCF的性能指标。