As policy catches up with the capabilities of generative AI, watermarking is central to content provenance efforts. Inference-time watermarks for autoregressive models are unfit for continuous modalities due to discretization inconsistencies. Existing methods overcome this by finetuning the modality tokenizers, nullifying the watermark's training-free advantage. In this work, motivated by the vocabulary redundancy of discretization, we propose an elegant solution for powerful and robust watermarking of synthetic audio. We theoretically analyze the impact of token errors on watermark detection, and effectively mitigate them using a reduced vocabulary obtained via community detection. Thorough experiments showcase that our gradient-free method can boost detectability by several orders of magnitude, while also achieving built-in robustness to audio modifications. Broadly, we discover a new state-of-the-art for token-level watermarks in multimedia, which simply arises from the nature of discrete representation learning.
翻译:随着政策追赶生成式AI的能力,水印技术已成为内容溯源工作的核心。针对自回归模型的推理时水印因离散化不一致性而无法适用于连续模态。现有方法通过对模态分词器进行微调来克服这一限制,但丧失了水印的免训练优势。本文受离散化过程中词汇冗余现象的启发,提出了一种针对合成音频的强大且鲁棒的水印优雅解决方案。我们从理论上分析了标记错误对水印检测的影响,并利用通过社区检测获得的精简词汇表有效缓解了这些问题。充分实验表明,我们的无梯度方法可将检测能力提升多个数量级,同时实现对音频修改的内置鲁棒性。总体而言,我们发现了多媒体中标记级水印的新技术巅峰,这一发现仅源于离散表示学习的本质特性。