Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding capacity, robustness is still primarily assessed against conventional distortions such as compression, additive noise, and resampling. However, the rise of deep learning-based attacks introduces novel and significant threats to watermark security. In this work, we investigate self voice conversion as a universal, content-preserving attack against audio watermarking systems. Self voice conversion remaps a speaker's voice to the same identity while altering acoustic characteristics through a voice conversion model. We demonstrate that this attack severely degrades the reliability of state-of-the-art watermarking approaches and highlight its implications for the security of modern audio watermarking techniques.
翻译:音频水印技术将辅助信息嵌入语音中,同时保持说话人身份、语言内容和感知质量。尽管基于神经和数字信号处理的水印方法的最新进展已提高了不可感知性和嵌入容量,但其鲁棒性仍主要针对压缩、加性噪声和重采样等传统失真进行评估。然而,基于深度学习的攻击的兴起给水印安全性带来了新颖且重大的威胁。在本工作中,我们研究自语音转换作为一种通用的、内容保持的音频水印系统攻击。自语音转换通过语音转换模型将说话人的声音重新映射至同一身份,同时改变声学特征。我们证明该攻击会严重降低最先进水印方法的可靠性,并强调其对现代音频水印技术安全性的影响。