This paper presents the DFKI-Speech system developed for the WildSpoof Challenge under the Spoofing aware Automatic Speaker Verification (SASV) track. We propose a robust SASV framework in which a spoofing detector and a speaker verification (SV) network operate in tandem. The spoofing detector employs a self-supervised speech embedding extractor as the frontend, combined with a state-of-the-art graph neural network backend. In addition, a top-3 layer based mixture-of-experts (MoE) is used to fuse high-level and low-level features for effective spoofed utterance detection. For speaker verification, we adapt a low-complexity convolutional neural network that fuses 2D and 1D features at multiple scales, trained with the SphereFace loss. Additionally, contrastive circle loss is applied to adaptively weight positive and negative pairs within each training batch, enabling the network to better distinguish between hard and easy sample pairs. Finally, fixed imposter cohort based AS Norm score normalization and model ensembling are used to further enhance the discriminative capability of the speaker verification system.
翻译:本文介绍了为 WildSpoof 挑战赛中反欺骗自动说话人验证(SASV)赛道开发的 DFKI-Speech 系统。我们提出了一种鲁棒的 SASV 框架,其中反欺骗检测器与说话人验证(SV)网络协同工作。反欺骗检测器采用自监督语音嵌入提取器作为前端,并结合最先进的图神经网络后端。此外,系统采用基于前三层的专家混合(MoE)模型来融合高层与低层特征,以实现有效的欺骗语音检测。对于说话人验证,我们采用了一种低复杂度的卷积神经网络,该网络在多个尺度上融合二维和一维特征,并使用 SphereFace 损失函数进行训练。此外,应用对比性圆环损失函数以自适应地加权每个训练批次中的正负样本对,使网络能够更好地区分困难与简单的样本对。最后,采用基于固定冒名者队列的 AS Norm 分数归一化与模型集成方法,以进一步增强说话人验证系统的判别能力。