Speech anonymisation prevents misuse of spoken data by removing any personal identifier while preserving at least linguistic content. However, emotion preservation is crucial for natural human-computer interaction. The well-known voice conversion technique StarGANv2-VC achieves anonymisation but fails to preserve emotion. This work presents an any-to-many semi-supervised StarGANv2-VC variant trained on partially emotion-labelled non-parallel data. We propose emotion-aware losses computed on the emotion embeddings and acoustic features correlated to emotion. Additionally, we use an emotion classifier to provide direct emotion supervision. Objective and subjective evaluations show that the proposed approach significantly improves emotion preservation over the vanilla StarGANv2-VC. This considerable improvement is seen over diverse datasets, emotions, target speakers, and inter-group conversions without compromising intelligibility and anonymisation.
翻译:语音匿名化通过移除所有个人标识符同时至少保留语言内容,可防止语音数据的滥用。然而,情感保留对于自然的人机交互至关重要。著名的语音转换技术StarGANv2-VC虽能实现匿名化,但无法保留情感。本文提出一种基于部分情感标注的非平行数据训练的任意到多人半监督StarGANv2-VC变体。我们提出基于情感嵌入和与情感相关的声学特征计算的情感感知损失函数,同时利用情感分类器提供直接的情感监督。客观与主观评估表明,该方法在情感保留方面显著优于原始StarGANv2-VC。这种显著改进在多样化数据集、情感类型、目标说话人及跨群体转换中均得到验证,且不影响可懂度和匿名化性能。