The rapid proliferation of generative audio synthesis and editing technologies has raised serious concerns about copyright infringement, data provenance, and the spread of misinformation via deepfake audio. Watermarking offers a proactive solution by embedding imperceptible yet identifiable and traceable signals into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to jointly optimize both robust detection and accurate attribution. This paper introduces Cross-Attention Robust Audio Watermark (XATTNMARK), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned time-frequency (TF) masking loss that captures fine-grained auditory masking effects, improving watermark imperceptibility. XATTNMARK achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing at varying strengths. This work advances audio watermarking for protecting intellectual property and ensuring authenticity in the era of generative AI.
翻译:生成式音频合成与编辑技术的快速普及引发了严重的版权侵权、数据溯源以及深度伪造音频传播虚假信息等问题。水印技术通过在音频内容中嵌入不可感知但可识别、可追溯的信号,提供了一种主动解决方案。尽管近年来基于神经网络的水印方法(如WavMark和AudioSeal)在鲁棒性和质量方面有所提升,但它们难以同时优化鲁棒检测与精确归因。本文提出交叉注意力鲁棒音频水印(XATTNMARK),通过生成器与检测器之间的部分参数共享、用于高效信息检索的交叉注意力机制以及用于改善消息分布的时序条件模块,弥合了这一差距。此外,我们提出了一种与心理声学对齐的时频(TF)掩蔽损失函数,该函数能捕捉精细的听觉掩蔽效应,从而提升水印的不可感知性。XATTNMARK在检测和归因任务中均达到了当前最优性能,展现出对各种音频变换(包括不同强度的生成式编辑挑战)的卓越鲁棒性。本工作推动了音频水印技术在生成式人工智能时代保护知识产权和确保内容真实性方面的发展。