Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking methods operate at the signal level (waveform or spectrogram), rendering the watermark vulnerable to generative attacks (e.g., neural codec and vocoder). To address this, we propose DuraMark, a robust information-level watermarking framework. It utilizes syllable duration editing to achieve watermark embedding. Specifically, DuraMark integrates a duration-controllable LLM-based TTS model to edit syllable durations during synthesis, coupled with a duration extractor to extract these durations for detection. Experiments demonstrate DuraMark's superior robustness against generative attacks, significantly outperforming signal-level baselines. Audio samples are available at https://muzw.github.io/duramark_demo/.
翻译:摘要:基于大语言模型(LLM)的文本转语音(TTS)模型已具备卓越的语音克隆能力,这引发了人们对深度伪造技术滥用的担忧。语音水印技术通过将可追溯信息嵌入合成语音中,可有效缓解此类风险。主流水印方法通常在信号层面(波形或语谱图)操作,导致水印在面对生成式攻击(如神经编解码器和声码器)时存在脆弱性。针对该问题,本文提出DuraMark——一种鲁棒的信息层级水印框架。该方法通过音节时长编辑实现水印嵌入:具体而言,DuraMark整合了基于时长可控的大语言模型TTS模块,在语音合成过程中编辑音节时长;同时配备时长提取器以检测这些时长特征。实验表明,DuraMark对生成式攻击具有优越的鲁棒性,显著优于信号层面的基线方法。音频样本见 https://muzw.github.io/duramark_demo/。