With the rapid deployment of speech generation systems in open environments, providing verifiable source attribution and copyright accountability for audio content has become critical. A gap in current research is the lack of a unified benchmark that systematically compares different watermark injection methods under realistic distribution shifts. To address this, we build VoxWatermark by applying 10 watermarking methods (4 neural and 6 traditional) with unified injection and annotation on multilingual, multi-source corpora, and introducing no-box, black-box, and white-box perturbations to simulate real recording and transmission conditions. Based on this benchmark, we propose AudioWMD as a robust baseline detector for large-scale, multi-method, cross-distribution settings. Results show that injection-method diversity and distribution shifts affect detection stability, while validating the effectiveness and scalability of AudioWMD. Dataset and code are publicly available.
翻译:随着语音生成系统在开放环境中快速部署,为音频内容提供可验证的来源追溯与版权问责机制变得至关重要。当前研究存在的空白在于缺乏统一基准,能够在实际分布偏移条件下系统性地对比不同水印注入方法。为此,我们构建了VoxWatermark,通过在多语言、多来源语料上采用统一注入与标注方式应用10种水印方法(4种神经网络方法与6种传统方法),并引入无盒、黑盒与白盒三类扰动以模拟真实录制与传输环境。基于该基准,我们提出AudioWMD作为面向大规模、多方法、跨分布场景的稳健基线检测器。实验结果表明,注入方法多样性与分布偏移会影响检测稳定性,同时验证了AudioWMD的有效性与可扩展性。数据集与代码均已公开。