Recent strides in neural speech synthesis technologies, while enjoying widespread applications, have nonetheless introduced a series of challenges, spurring interest in the defence against the threat of misuse and abuse. Notably, source attribution of synthesized speech has value in forensics and intellectual property protection, but prior work in this area has certain limitations in scope. To address the gaps, we present our findings concerning the identification of the sources of synthesized speech in this paper. We investigate the existence of speech synthesis model fingerprints in the generated speech waveforms, with a focus on the acoustic model and the vocoder, and study the influence of each component on the fingerprint in the overall speech waveforms. Our research, conducted using the multi-speaker LibriTTS dataset, demonstrates two key insights: (1) vocoders and acoustic models impart distinct, model-specific fingerprints on the waveforms they generate, and (2) vocoder fingerprints are the more dominant of the two, and may mask the fingerprints from the acoustic model. These findings strongly suggest the existence of model-specific fingerprints for both the acoustic model and the vocoder, highlighting their potential utility in source identification applications.
翻译:近年来,神经语音合成技术虽已获得广泛应用,但也带来了一系列挑战,引发了人们对防范其误用和滥用威胁的关注。值得注意的是,合成语音的源归属在司法取证和知识产权保护方面具有重要价值,但该领域的先前研究在范围上存在一定局限。为弥补这些不足,本文提出了我们在合成语音源识别方面的研究发现。我们探究了合成语音波形中是否存在语音合成模型指纹,重点关注声学模型和声码器,并研究了各组件对整体语音波形中指纹的影响。我们使用多说话人LibriTTS数据集进行的研究揭示了两个关键发现:(1) 声码器和声学模型会在其生成的波形中留下各自独特的、模型特定的指纹;(2) 声码器指纹是两者中更为主导的,并可能掩盖来自声学模型的指纹。这些发现有力地表明声学模型和声码器均存在模型特定的指纹,凸显了它们在源识别应用中的潜在效用。