The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder. Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information. We propose an approach based upon neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes, which are known to effectively bottleneck speaker-related information: we demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.
翻译:绝大多数说话人匿名化方法涉及提取基频估计、语言特征和说话人嵌入,其中说话人嵌入被扰动以混淆说话人身份,随后使用声码器重新合成匿名化的语音波形。近期研究表明,x-vector变换难以实现一致性控制:基频和语言特征中包含的其他说话人信息在声码化过程中会重新纠缠,导致匿名化语音信号仍保留说话人信息。本文提出基于神经音频编解码器(NAC)的方法,该类编解码器在与语言模型结合时能生成高质量的合成语音。NAC使用量化编码,其已被证实能有效瓶颈说话人相关信息:通过应用2022年语音隐私挑战赛的评估框架,我们展示了基于NAC语言建模的说话人匿名化系统的潜力。