Personalized TTS is an exciting and highly desired application that allows users to train their TTS voice using only a few recordings. However, TTS training typically requires many hours of recording and a large model, making it unsuitable for deployment on mobile devices. To overcome this limitation, related works typically require fine-tuning a pre-trained TTS model to preserve its ability to generate high-quality audio samples while adapting to the target speaker's voice. This process is commonly referred to as ``voice cloning.'' Although related works have achieved significant success in changing the TTS model's voice, they are still required to fine-tune from a large pre-trained model, resulting in a significant size for the voice-cloned model. In this paper, we propose applying trainable structured pruning to voice cloning. By training the structured pruning masks with voice-cloning data, we can produce a unique pruned model for each target speaker. Our experiments demonstrate that using learnable structured pruning, we can compress the model size to 7 times smaller while achieving comparable voice-cloning performance.
翻译:个性化文本转语音(TTS)是一项令人兴奋且备受期待的应用,它允许用户仅凭少量录音即可训练出属于自己的TTS语音。然而,TTS训练通常需要数小时的录音数据和庞大的模型,这使其难以部署在移动设备上。为了克服这一局限,相关工作通常需要对预训练的TTS模型进行微调,在适应目标说话人声音的同时保持其生成高质量音频样本的能力,这一过程通常被称为“语音克隆”。尽管相关研究在改变TTS模型音色方面取得了显著成功,但它们仍需从大型预训练模型进行微调,导致语音克隆后的模型体积依然庞大。本文提出将可训练的结构化剪枝应用于语音克隆。通过使用语音克隆数据训练结构化剪枝掩码,我们可为每个目标说话人生成独特的剪枝模型。实验表明,采用可学习的结构化剪枝方法,我们能够在保持相当语音克隆性能的同时,将模型体积压缩至原来的七分之一。