Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these problems, we propose TraceableSpeech, a novel TTS model that directly generates watermarked speech, improving watermark imperceptibility and speech quality. Furthermore, We design the frame-wise imprinting and extraction of watermarks, achieving higher robustness against resplicing attacks and temporal flexibility in operation. Experimental results show that TraceableSpeech outperforms the strong baseline where VALL-E or HiFicodec individually uses WavMark in watermark imperceptibility, speech quality and resilience against resplicing attacks. It also can apply to speech of various durations. The code is avaliable at https://github.com/zjzser/TraceableSpeech
翻译:文本转语音(TTS)技术的进步所带来的各种威胁,促使人们需要可靠地追踪合成语音。然而,当前完成此任务的方法是在音频生成后单独添加水印,这一过程会损害语音质量和水印的不可感知性。此外,这些方法在鲁棒性和灵活性方面存在局限。为解决这些问题,我们提出了TraceableSpeech,一种直接生成带水印语音的新型TTS模型,从而提升了水印的不可感知性和语音质量。此外,我们设计了逐帧的水印嵌入与提取方案,实现了对重拼接攻击更高的鲁棒性以及操作上的时间灵活性。实验结果表明,在VALL-E或HiFicodec单独使用WavMark的强基线对比中,TraceableSpeech在水印不可感知性、语音质量以及对重拼接攻击的抵抗力方面均表现更优。该方法也适用于不同时长的语音。代码发布于 https://github.com/zjzser/TraceableSpeech