In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level, and a novel perceptual loss inspired by auditory masking, that enables AudioSeal to achieve better imperceptibility. AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics. Additionally, AudioSeal is designed with a fast, single-pass detector, that significantly surpasses existing models in speed - achieving detection up to two orders of magnitude faster, making it ideal for large-scale and real-time applications.
翻译:在快速发展的语音生成模型领域,迫切需要确保音频真实性以应对语音克隆风险。我们提出了AudioSeal,这是首个专门为AI生成语音的局部检测而设计的音频水印技术。AudioSeal采用生成器/检测器架构,通过联合训练局部化损失函数实现样本级的局部水印检测,并引入受听觉掩蔽效应启发的新型感知损失,使AudioSeal获得更佳的不可感知性。在对抗现实音频操作的鲁棒性和不可感知性方面,基于自动与人工评估指标,AudioSeal均达到最先进性能。此外,AudioSeal采用快速单次检测器设计,其检测速度显著超越现有模型——提升高达两个数量级,使其特别适用于大规模实时应用场景。