Advances in neural speech synthesis have brought us technology that is not only close to human naturalness, but is also capable of instant voice cloning with little data, and is highly accessible with pre-trained models available. Naturally, the potential flood of generated content raises the need for synthetic speech detection and watermarking. Recently, considerable research effort in synthetic speech detection has been related to the Automatic Speaker Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on passive countermeasures. This paper takes a complementary view to generated speech detection: a synthesis system should make an active effort to watermark the generated speech in a way that aids detection by another machine, but remains transparent to a human listener. We propose a collaborative training scheme for synthetic speech watermarking and show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance over conventional classifier training. Furthermore, we demonstrate how collaborative training can be paired with augmentation strategies for added robustness against noise and time-stretching. Finally, listening tests demonstrate that collaborative training has little adverse effect on perceptual quality of vocoded speech.
翻译:神经语音合成的进展不仅带来了接近人类自然度的技术,还能通过少量数据实现即时语音克隆,并凭借预训练模型的高度可获取性。自然而言,生成内容的潜在泛滥引发了对合成语音检测与水印技术的需求。近年来,大量关于合成语音检测的研究工作与自动说话人验证与欺骗对抗挑战赛(ASVspoof)相关,该挑战赛侧重于被动对抗措施。本文提出了生成语音检测的互补视角:合成系统应主动为生成语音添加水印,使其既能辅助机器检测,又对人类听众保持透明。我们提出了一种针对合成语音水印的协同训练方案,并证明HiFi-GAN神经声码器与ASVspoof 2021基线对抗模型协同训练时,相比传统分类器训练能够持续提升检测性能。此外,我们展示了如何将协同训练与增强策略结合,以增强对噪声和时间拉伸的鲁棒性。最后,听力测试表明协同训练对声码语音的感知质量影响极小。