Audio anti-spoofing systems are typically formulated as binary classifiers distinguishing bona fide from spoofed speech. This assumption fails under layered generative processing, where benign transformations introduce distributional shifts that are misclassified as spoofing. We show that phonation-modifying voice conversion and speech restoration are treated as out-of-distribution despite preserving speaker authenticity. Using a multi-class setup separating bona fide, converted, spoofed, and converted-spoofed speech, we analyse model behaviour through self-supervised learning (SSL) embeddings and acoustic correlates. The benign transformations induce a drift in the SSL space, compressing bona fide and spoofed speech and reducing classifier separability. Reformulating anti-spoofing as a multi-class problem improves robustness to benign shifts while preserving spoof detection, suggesting binary systems model the distribution of raw speech rather than authenticity itself.
翻译:音频反欺骗系统通常被构建为区分真实语音与伪造语音的二元分类器。这一假设在分层生成处理下失效,因为良性变换会引入分布偏移,从而被误判为伪造。我们证明,尽管保持说话人真实性,改变发声方式的语音转换和语音修复仍被视为分布外样本。通过采用将真实语音、转换语音、伪造语音及转换后伪造语音分离的多类别设置,我们利用自监督学习嵌入和声学相关性分析了模型行为。良性变换导致自监督学习空间发生漂移,压缩了真实语音与伪造语音的分布,降低了分类器的可分离性。将反欺骗问题重构为多类别问题可提升对良性偏移的鲁棒性,同时保持伪造检测能力,这表明二元系统建模的是原始语音的分布而非真实性本身。