Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.
翻译:近期文本转语音技术的进步使得生成与真实人声几乎难以区分的高保真合成语音成为可能。尽管研究表明基于自监督学习的语音编码器在深度伪造检测中具有有效性,但这些模型在未见说话人场景下的泛化能力仍显不足。我们的定量分析表明,这些编码器表示受说话人信息影响显著,导致检测器利用说话人特定相关性而非伪影相关线索进行判断,我们将此现象称为说话人纠缠。为缓解此种依赖性,我们提出SNAP说话人消隐框架:通过估计说话人子空间并施加正交投影来抑制说话人相关成分,从而在残差特征中分离合成伪影。通过降低说话人纠缠程度,SNAP促使检测器聚焦于伪影相关模式,最终实现当前最优检测性能。