Speaker anonymization aims to conceal a speaker's identity while preserving content information in speech. Current mainstream neural-network speaker anonymization systems disentangle speech into prosody-related, content, and speaker representations. The speaker representation is then anonymized by a selection-based speaker anonymizer that uses a mean vector over a set of randomly selected speaker vectors from an external pool of English speakers. However, the resulting anonymized vectors are subject to severe privacy leakage against powerful attackers, reduction in speaker diversity, and language mismatch problems for unseen-language speaker anonymization. To generate diverse, language-neutral speaker vectors, this paper proposes an anonymizer based on an orthogonal Householder neural network (OHNN). Specifically, the OHNN acts like a rotation to transform the original speaker vectors into anonymized speaker vectors, which are constrained to follow the distribution over the original speaker vector space. A basic classification loss is introduced to ensure that anonymized speaker vectors from different speakers have unique speaker identities. To further protect speaker identities, an improved classification loss and similarity loss are used to push original-anonymized sample pairs away from each other. Experiments on VoicePrivacy Challenge datasets in English and the \textit{AISHELL-3} dataset in Mandarin demonstrate the proposed anonymizer's effectiveness.
翻译:说话人匿名化旨在隐藏说话人身份的同时保留语音中的内容信息。当前主流的基于神经网络的说话人匿名化系统将语音解耦为韵律相关表示、内容表示和说话人表示。随后,通过基于选择的说话人匿名器对说话人表示进行匿名化处理,该匿名器使用从外部英语说话人池中随机选取的一组说话人向量的均值向量。然而,对于强攻击者而言,由此生成的匿名化向量存在严重的隐私泄露风险、说话人多样性降低以及针对未见语言说话人匿名化的语言不匹配问题。为生成多样且语言中性的说话人向量,本文提出了一种基于正交Householder神经网络(OHNN)的匿名器。具体而言,OHNN通过类似旋转的变换将原始说话人向量转换为匿名化说话人向量,并约束这些向量服从原始说话人向量空间的分布。引入基础分类损失以确保不同说话人的匿名化向量具有唯一说话人身份。为进一步保护说话人身份,采用改进的分类损失和相似性损失将原始-匿名样本对相互推离。在英语VoicePrivacy挑战数据集和普通话AISHELL-3数据集上的实验证明了所提匿名器的有效性。