Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.
翻译:零样本文本到语音(TTS)语音克隆存在严重的隐私风险,亟需从已训练的TTS模型中移除特定说话人身份。传统机器遗忘在此场景中不足,因为零样本TTS能够仅根据参考提示动态重建语音。我们将此任务形式化为语音生成说话者投毒(SGSP),通过修改已训练模型阻止生成特定身份,同时保留对其他说话者的效用。我们评估了针对1、15和100个被遗忘说话者的推理时过滤和参数修改基线方法。性能通过效用(词错误率)与隐私之间的权衡衡量,隐私由AUC和被遗忘说话者相似度(FSSIM)量化。我们实现了对多达15个说话者的强隐私保护,但在100个说话者时因身份重叠增加暴露出可扩展性限制。因此,本研究引入了一个新问题及评估框架,推动生成式语音隐私领域的进一步发展。