Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. In this paper, we show that such issues could be caused by speaker bias, where models learn individual voice traits rather than markers of manipulation or generation. We propose a teacher-student framework for speaker-invariant spoofing detection that disentangles identity without requiring speaker labels. We leverage a pre-trained speaker recognition teacher to guide a student model via a gradient reversal layer. To control the balance between suppressing cues related to voice identity with the preservation of those related to spoofing detection, we integrate a Variational Information Bottleneck. Evaluations across nine datasets show our model achieves a 25.7% relative reduction to the EER compared to the MHFA baseline.
翻译:先进的生成式语音技术可能破坏语音生物识别的可靠性。尽管欺骗检测系统在域内条件下表现出色,但泛化到域外场景时往往效果不佳。本文表明,此类问题可能源于说话者偏差,即模型学习的是个体语音特征而非篡改或生成的标记。我们提出一种教师-学生框架,用于说话者不变的欺骗检测,无需说话者标签即可解耦身份信息。我们利用预训练的说话者识别教师模型,通过梯度反转层指导学生模型。为了平衡抑制与语音身份相关线索与保留欺骗检测相关线索,我们集成了变分信息瓶颈。在九个数据集上的评估显示,与MHFA基线相比,我们的模型实现了等错误率相对降低25.7%。