Advances in neural speech synthesis have brought us technology that is not only close to human naturalness, but is also capable of instant voice cloning with little data, and is highly accessible with pre-trained models available. Naturally, the potential flood of generated content raises the need for synthetic speech detection and watermarking. Recently, considerable research effort in synthetic speech detection has been related to the Automatic Speaker Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on passive countermeasures. This paper takes a complementary view to generated speech detection: a synthesis system should make an active effort to watermark the generated speech in a way that aids detection by another machine, but remains transparent to a human listener. We propose a collaborative training scheme for synthetic speech watermarking and show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance over conventional classifier training. Furthermore, we demonstrate how collaborative training can be paired with augmentation strategies for added robustness against noise and time-stretching. Finally, listening tests demonstrate that collaborative training has little adverse effect on perceptual quality of vocoded speech.
翻译:神经语音合成技术的进展不仅带来了接近人类自然度的技术,还能利用极少数据实现即时声音克隆,并通过预训练模型具有高度可及性。自然,生成内容的大量涌现引发了对合成语音检测与水印技术的需求。近年来,合成语音检测领域的大量研究聚焦于自动说话人验证与欺骗对抗挑战赛(ASVspoof),该挑战赛侧重于被动式对策。本文从互补视角探讨生成语音检测:合成系统应主动对生成语音进行水印标记,以协助其他机器检测,同时对人类听者保持透明。我们提出了一种合成语音水印的协作训练方案,并证明与ASVspoof 2021基线反欺骗模型协作的HiFi-GAN神经声码器,其检测性能持续优于传统分类器训练。此外,我们展示了如何将协作训练与增强策略结合,以提升对噪声和时间拉伸的鲁棒性。最后,听力测试表明,协作训练对声码语音的感知质量影响甚微。