TriniMark: A Robust Generative Speech Watermarking Method for Trinity-Level Traceability

Diffusion-based speech generation has achieved remarkable fidelity, increasing the risk of misuse and unauthorized redistribution. However, most existing generative speech watermarking methods are developed for GAN-based pipelines, and watermarking for diffusion-based speech generation remains comparatively underexplored. In addition, prior work often focuses on content-level provenance, while support for model-level and user-level attribution is less mature. We propose \textbf{TriniMark}, a diffusion-based generative speech watermarking framework that targets trinity-level traceability, i.e., the ability to associate a generated speech sample with (i) the embedded watermark message (content-level provenance), (ii) the source generative model (model-level attribution), and (iii) the end user who requested generation (user-level traceability). TriniMark uses a lightweight encoder to embed watermark bits into time-domain speech features and reconstruct the waveform, and a temporal-aware gated convolutional decoder for reliable bit recovery. We further introduce a waveform-guided fine-tuning strategy to transfer watermarking capability into a diffusion model. Finally, we incorporate variable-watermark training so that a single trained model can embed different watermark messages at inference time, enabling scalable user-level traceability. Experiments on speech datasets indicate that TriniMark maintains speech quality while improving robustness to common single and compound signal-processing attacks, and it supports high-capacity watermarking for large-scale traceability.

翻译：基于扩散的语音生成已取得显著的保真度，这增加了滥用和未经授权再分发的风险。然而，现有的大多数生成式语音水印方法是为基于GAN的流程开发的，而针对基于扩散的语音生成的水印技术仍相对缺乏探索。此外，先前的研究通常侧重于内容级溯源，而对模型级和用户级归属的支持尚不成熟。我们提出了 \textbf{TriniMark}，一种基于扩散的生成式语音水印框架，旨在实现三位一体级的可溯源性，即能够将生成的语音样本与以下三者关联起来：(i) 嵌入的水印信息（内容级溯源），(ii) 源生成模型（模型级归属），以及 (iii) 请求生成的终端用户（用户级可溯源性）。TriniMark 使用一个轻量级编码器将水印比特嵌入到时域语音特征中并重建波形，同时采用一个具有时序感知的门控卷积解码器以实现可靠的比特恢复。我们进一步引入了一种波形引导的微调策略，以将水印能力迁移到扩散模型中。最后，我们结合了可变水印训练，使得单个训练好的模型在推理时能够嵌入不同的水印信息，从而实现可扩展的用户级可溯源性。在语音数据集上的实验表明，TriniMark 在保持语音质量的同时，提高了对常见单一及复合信号处理攻击的鲁棒性，并支持面向大规模溯源的高容量水印嵌入。