Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, preemptively regulating the creation and dissemination of synthesized content. Thus, this paper, as a pioneer, proposes the generative robust audio watermarking method (Groot), presenting a paradigm for proactively supervising the synthesized audio and its source diffusion models. In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously, facilitated by parameter-fixed diffusion models equipped with a dedicated encoder. The watermark embedded within the audio can subsequently be retrieved by a lightweight decoder. The experimental results highlight Groot's outstanding performance, particularly in terms of robustness, surpassing that of the leading state-of-the-art methods. Beyond its impressive resilience against individual post-processing attacks, Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%.
翻译:随着扩散模型等生成模型的蓬勃发展,区分合成音频与自然音频的任务变得愈发艰巨。深度伪造检测为应对这一挑战提供了可行的解决方案。然而,这种防御措施无意中促进了生成模型的持续改进。水印技术作为一种主动且可持续的策略应运而生,能够预先规范合成内容的创建与传播。因此,本文作为先驱,提出了生成式鲁棒音频水印方法(Groot),为主动监管合成音频及其源扩散模型提供了一种范式。在此范式中,水印生成与音频合成的过程同时进行,由配备专用编码器的参数固定扩散模型实现。嵌入音频中的水印随后可通过轻量级解码器提取。实验结果突显了Groot的卓越性能,尤其是在鲁棒性方面,超越了当前领先的最先进方法。除了对单一后处理攻击表现出令人印象深刻的抵抗力外,Groot在面对复合攻击时也展现出非凡的鲁棒性,平均水印提取准确率保持在95%左右。