Speaker anonymization aims to conceal a speaker's identity, without considering the linguistic content. In this study, we reveal a weakness of Librispeech, the dataset that is commonly used to evaluate anonymizers: the books read by the Librispeech speakers are so distinct, that speakers can be identified by their vocabularies. Even perfect anonymizers cannot prevent this identity leakage. The EdAcc dataset is better in this regard: only a few speakers can be identified through their vocabularies, encouraging the attacker to look elsewhere for the identities of the anonymized speakers. EdAcc also comprises spontaneous speech and more diverse speakers, complementing Librispeech and giving more insights into how anonymizers work.
翻译:说话人匿名化旨在隐藏说话人的身份,而不考虑其语言内容。在本研究中,我们揭示了LibriSpeech——这一常用于评估匿名化系统的数据集——的一个弱点:该数据集中朗读者所阅读的书籍内容差异显著,以至于说话人可以通过其使用的词汇被识别。即使完美的匿名化系统也无法防止这种身份泄露。相比之下,EdAcc数据集在此方面表现更优:仅少数说话人可通过其词汇被识别,这促使攻击者转向其他途径寻找匿名化说话人的身份。此外,EdAcc还包含自发语音和更多样化的说话人,既是对LibriSpeech的补充,也为理解匿名化系统的工作原理提供了更深入的视角。