The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder. Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information. We propose an approach based upon neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes, which are known to effectively bottleneck speaker-related information: we demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.
翻译:大多数说话者匿名化方法涉及提取基频估计、语言特征以及说话者嵌入,通过对说话者嵌入进行扰动以混淆说话者身份,然后使用声码器重新合成匿名化的语音波形。近期研究表明,x-vector变换难以实现一致控制:基频和语言特征中包含的其他说话者信息来源在声码化过程中会重新纠缠,意味着匿名化后的语音信号仍包含说话者信息。我们提出一种基于神经音频编解码(NAC)的方法,已知NAC在与语言模型结合时能够生成高质量的合成语音。NAC使用量化编码,这些编码已被证实能有效瓶颈化说话者相关信息:我们通过应用2022年语音隐私挑战赛的评估框架,展示了基于NAC语言建模的说话者匿名化系统的潜力。