The rise of AI-driven generative models has enabled the creation of highly realistic speech deepfakes - synthetic audio signals that can imitate target speakers' voices - raising critical security concerns. Existing methods for detecting speech deepfakes primarily rely on supervised learning, which suffers from two critical limitations: limited generalization to unseen synthesis techniques and a lack of explainability. In this paper, we address these issues by introducing a novel interpretable one-class detection framework, which reframes speech deepfake detection as an anomaly detection task. Our model is trained exclusively on real speech to characterize its distribution, enabling the classification of out-of-distribution samples as synthetically generated. Additionally, our framework produces interpretable anomaly maps during inference, highlighting anomalous regions across both time and frequency domains. This is done through a Student-Teacher Feature Pyramid Matching system, enhanced with Discrepancy Scaling to improve generalization capabilities across unseen data distributions. Extensive evaluations demonstrate the superior performance of our approach compared to the considered baselines, validating the effectiveness of framing speech deepfake detection as an anomaly detection problem.
翻译:随着人工智能驱动的生成模型的兴起,能够模仿目标说话人声音的高度逼真的语音深度伪造——即合成音频信号——的创建已成为可能,这引发了严重的安全担忧。现有的语音深度伪造检测方法主要依赖于监督学习,其存在两个关键局限性:对未见过的合成技术泛化能力有限,以及缺乏可解释性。在本文中,我们通过引入一种新颖的可解释单类检测框架来解决这些问题,该框架将语音深度伪造检测重新定义为一项异常检测任务。我们的模型仅使用真实语音进行训练,以刻画其分布特征,从而能够将分布外的样本分类为合成生成的。此外,我们的框架在推理过程中会生成可解释的异常图,突出显示时域和频域中的异常区域。这是通过一个学生-教师特征金字塔匹配系统实现的,该系统通过引入差异缩放进行增强,以提高对未见数据分布的泛化能力。广泛的评估表明,与所考虑的基线方法相比,我们的方法具有优越的性能,验证了将语音深度伪造检测构建为异常检测问题的有效性。