The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance

Automatic speaker verification is susceptible to various manipulations and spoofing, such as text-to-speech synthesis, voice conversion, replay, tampering, adversarial attacks, and so on. We consider a new spoofing scenario called "Partial Spoof" (PS) in which synthesized or transformed speech segments are embedded into a bona fide utterance. While existing countermeasures (CMs) can detect fully spoofed utterances, there is a need for their adaptation or extension to the PS scenario. We propose various improvements to construct a significantly more accurate CM that can detect and locate short-generated spoofed speech segments at finer temporal resolutions. First, we introduce newly developed self-supervised pre-trained models as enhanced feature extractors. Second, we extend our PartialSpoof database by adding segment labels for various temporal resolutions. Since the short spoofed speech segments to be embedded by attackers are of variable length, six different temporal resolutions are considered, ranging from as short as 20 ms to as large as 640 ms. Third, we propose a new CM that enables the simultaneous use of the segment-level labels at different temporal resolutions as well as utterance-level labels to execute utterance- and segment-level detection at the same time. We also show that the proposed CM is capable of detecting spoofing at the utterance level with low error rates in the PS scenario as well as in a related logical access (LA) scenario. The equal error rates of utterance-level detection on the PartialSpoof database and ASVspoof 2019 LA database were 0.77 and 0.90%, respectively.

翻译：自动说话人验证易受各种操纵和欺骗攻击，如文本到语音合成、语音转换、重放、篡改、对抗攻击等。本文考虑一种名为"局部欺骗"（Partial Spoof, PS）的新型欺骗场景，其中合成或转换的语音片段被嵌入到真实语音中。现有对抗措施（CMs）虽能检测完全伪造的语音，但需要对其进行适配或扩展以应对PS场景。我们提出多种改进方法，构建了一种精度显著提升的对抗措施，能够以更细的时间分辨率检测并定位短时生成的伪造语音片段。首先，我们引入新开发的自我监督预训练模型作为增强型特征提取器。其次，通过添加不同时间分辨率的片段标签扩展了PartialSpoof数据库。由于攻击者嵌入的短时伪造语音片段长度可变，我们考虑了六种不同的时间分辨率，范围从短至20毫秒到长至640毫秒。第三，我们提出一种新型对抗措施，能够同时利用不同时间分辨率的片段级标签和话语级标签，实现话语级与片段级同步检测。实验表明，所提对抗措施在PS场景及相关逻辑访问（LA）场景中均能以较低错误率完成话语级欺骗检测。在PartialSpoof数据库和ASVspoof 2019 LA数据库的话语级检测等错误率分别达到0.77%和0.90%。