We present NanoVoice, a personalized text-to-speech model that efficiently constructs voice adapters for multiple speakers simultaneously. NanoVoice introduces a batch-wise speaker adaptation technique capable of fine-tuning multiple references in parallel, significantly reducing training time. Beyond building separate adapters for each speaker, we also propose a parameter sharing technique that reduces the number of parameters used for speaker adaptation. By incorporating a novel trainable scale matrix, NanoVoice mitigates potential performance degradation during parameter sharing. NanoVoice achieves performance comparable to the baselines, while training 4 times faster and using 45 percent fewer parameters for speaker adaptation with 40 reference voices. Extensive ablation studies and analysis further validate the efficiency of our model.
翻译:本文提出NanoVoice,一种个性化文本转语音模型,能够高效地为多个说话者同时构建语音适配器。NanoVoice引入了批处理式说话人自适应技术,可并行微调多个参考语音,显著缩短训练时间。除了为每个说话者构建独立适配器外,我们还提出一种参数共享技术以减少说话人自适应所需的参数量。通过引入新型可训练缩放矩阵,NanoVoice有效缓解了参数共享过程中可能出现的性能下降问题。在40个参考语音的实验中,NanoVoice在取得与基线模型相当性能的同时,训练速度提升4倍,说话人自适应参数量减少45%。大量消融实验与分析进一步验证了我们模型的高效性。