Disentanglement-based speaker anonymization involves decomposing speech into a semantically meaningful representation, altering the speaker embedding, and resynthesizing a waveform using a neural vocoder. State-of-the-art systems of this kind are known to remove emotion information. Possible reasons include mode collapse in GAN-based vocoders, unintended modeling and modification of emotions through speaker embeddings, or excessive sanitization of the intermediate representation. In this paper, we conduct a comprehensive evaluation of a state-of-the-art speaker anonymization system to understand the underlying causes. We conclude that the main reason is the lack of emotion-related information in the intermediate representation. The speaker embeddings also have a high impact, if they are learned in a generative context. The vocoder's out-of-distribution performance has a smaller impact. Additionally, we discovered that synthesis artifacts increase spectral kurtosis, biasing emotion recognition evaluation towards classifying utterances as angry. Therefore, we conclude that reporting unweighted average recall alone for emotion recognition performance is suboptimal.
翻译:基于解耦的说话人匿名化技术涉及将语音分解为具有语义意义的表示、修改说话人嵌入,并使用神经声码器重新合成波形。已知此类最先进的系统会移除情感信息。可能的原因包括:基于GAN的声码器中的模式崩溃、通过说话人嵌入对情感进行非预期的建模与修改,或是中间表示的过度净化。在本文中,我们对一个最先进的说话人匿名化系统进行了全面评估,以探究其根本原因。我们的结论是:主要原因是中间表示中缺乏与情感相关的信息。如果说话人嵌入是在生成式上下文中学习的,它们也会产生重大影响。声码器的分布外泛化性能影响较小。此外,我们发现合成伪影会增加频谱峰度,导致情感识别评估倾向于将语音片段分类为愤怒。因此,我们认为仅报告情感识别性能的非加权平均召回率是不够理想的。