Component-level audio Spoofing (Comp-Spoof) targets a new form of audio manipulation where only specific components of a signal, such as speech or environmental sound, are forged or substituted while other components remain genuine. Existing anti-spoofing datasets and methods treat an utterance or a segment as entirely bona fide or entirely spoofed, and thus cannot accurately detect component-level spoofing. To address this, we construct a new dataset, CompSpoof, covering multiple combinations of bona fide and spoofed speech and environmental sound. We further propose a separation-enhanced joint learning framework that separates audio components apart and applies anti-spoofing models to each one. Joint learning is employed, preserving information relevant for detection. Extensive experiments demonstrate that our method outperforms the baseline, highlighting the necessity of separate components and the importance of detecting spoofing for each component separately. Datasets and code are available at: https://github.com/XuepingZhang/CompSpoof.
翻译:组件级音频欺骗(Comp-Spoof)针对一种新型音频篡改形式,即仅伪造或替换信号中的特定组件(如语音或环境声),而其他组件保持真实。现有的反欺骗数据集与方法将整个话语或片段视为完全真实或完全伪造,因此无法准确检测组件级欺骗。为解决此问题,我们构建了一个新数据集CompSpoof,涵盖真实与伪造语音及环境声的多种组合。我们进一步提出了一种分离增强的联合学习框架,该框架将音频组件分离,并对每个组件应用反欺骗模型。通过采用联合学习,保留了与检测相关的信息。大量实验表明,我们的方法优于基线,突显了分离组件的必要性以及对每个组件分别进行欺骗检测的重要性。数据集与代码公开于:https://github.com/XuepingZhang/CompSpoof。