The vast majority of approaches to speaker anonymization involve the extraction of fundamental frequency estimates, linguistic features and a speaker embedding which is perturbed to obfuscate the speaker identity before an anonymized speech waveform is resynthesized using a vocoder. Recent work has shown that x-vector transformations are difficult to control consistently: other sources of speaker information contained within fundamental frequency and linguistic features are re-entangled upon vocoding, meaning that anonymized speech signals still contain speaker information. We propose an approach based upon neural audio codecs (NACs), which are known to generate high-quality synthetic speech when combined with language models. NACs use quantized codes, which are known to effectively bottleneck speaker-related information: we demonstrate the potential of speaker anonymization systems based on NAC language modeling by applying the evaluation framework of the Voice Privacy Challenge 2022.
翻译:绝大多数说话人匿名化方法涉及提取基频估计、语言特征以及说话人嵌入,其中说话人嵌入被扰动以混淆说话人身份,随后利用声码器重新合成匿名化的语音波形。近期研究表明,x-向量变换难以实现一致控制:基频和语言特征中包含的其他说话人信息在声码化过程中会重新纠缠,导致匿名化后的语音信号仍含有说话人信息。我们提出一种基于神经音频编解码器的方法,该编解码器在与语言模型结合时能够生成高质量的合成语音。神经音频编解码器使用量化编码,这种编码能有效瓶颈说话人相关信息:通过应用2022年语音隐私挑战赛的评估框架,我们展示了基于神经音频编解码语言建模的说话人匿名化系统的潜力。