Recent strides in neural speech synthesis technologies, while enjoying widespread applications, have nonetheless introduced a series of challenges, spurring interest in the defence against the threat of misuse and abuse. Notably, source attribution of synthesized speech has value in forensics and intellectual property protection, but prior work in this area has certain limitations in scope. To address the gaps, we present our findings concerning the identification of the sources of synthesized speech in this paper. We investigate the existence of speech synthesis model fingerprints in the generated speech waveforms, with a focus on the acoustic model and the vocoder, and study the influence of each component on the fingerprint in the overall speech waveforms. Our research, conducted using the multi-speaker LibriTTS dataset, demonstrates two key insights: (1) vocoders and acoustic models impart distinct, model-specific fingerprints on the waveforms they generate, and (2) vocoder fingerprints are the more dominant of the two, and may mask the fingerprints from the acoustic model. These findings strongly suggest the existence of model-specific fingerprints for both the acoustic model and the vocoder, highlighting their potential utility in source identification applications.
翻译:近年来,神经语音合成技术的进步在广泛得到应用的同时,也带来了一系列挑战,促使人们关注如何防御其误用和滥用。其中,合成语音的来源归属在取证和知识产权保护中具有价值,但该领域的先前研究在范围上存在一定局限性。为解决这些不足,本文提出了关于合成语音来源识别的研究发现。我们探究了生成语音波形中是否存在语音合成模型指纹,重点关注声学模型和声码器,并研究了各组件对整体语音波形中指纹的影响。基于多说话人LibriTTS数据集的研究表明: (1)声码器和声学模型会在其生成的波形上留下特定于模型的独特指纹; (2)声码器指纹在两者中占主导地位,可能掩盖声学模型的指纹。这些发现有力地表明,声学模型和声码器均存在模型特定的指纹,突显了它们在来源识别应用中的潜在价值。