Modern zero-shot text-to-speech (TTS) models offer unprecedented expressivity but also pose serious crime risks, as they can synthesize voices of individuals who never consented. In this context, speaker unlearning aims to prevent the generation of specific speaker identities upon request. Existing approaches, reliant on retraining, are costly and limited to speakers seen in the training set. We present TruS, a training-free speaker unlearning framework that shifts the paradigm from data deletion to inference-time control. TruS steers identity-specific hidden activations to suppress target speakers while preserving other attributes (e.g., prosody and emotion). Experimental results show that TruS effectively prevents voice generation on both seen and unseen opt-out speakers, establishing a scalable safeguard for speech synthesis. The demo and code are available on http://mmai.ewha.ac.kr/trus.
翻译:现代零样本文本到语音(TTS)模型提供了前所未有的表现力,但也带来了严重的犯罪风险,因为它们可以合成从未同意使用其声音的个体的语音。在此背景下,说话人遗忘旨在根据请求阻止生成特定说话人身份。现有方法依赖于重新训练,成本高昂且仅限于训练集中出现过的说话人。我们提出了TruS,一种无需训练的说话人遗忘框架,将范式从数据删除转向推理时控制。TruS通过引导身份特定的隐藏激活来抑制目标说话人,同时保留其他属性(如韵律和情感)。实验结果表明,TruS能有效阻止对已见及未见退出说话人的语音生成,为语音合成建立了可扩展的防护机制。演示与代码可在 http://mmai.ewha.ac.kr/trus 获取。