For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this out-of-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities. In particular, we prune off redundant connections from self-attention layers whose attention weights are below the threshold. To flexibly determine the pruning strength for searching optimal degree of generalization, we also propose a new differentiable pruning method that allows the model to automatically learn the thresholds. Evaluations on zero-shot multi-speaker TTS verify the effectiveness of our method in terms of voice quality and speaker similarity.
翻译:针对个性化语音生成,神经文本转语音(TTS)模型必须利用目标说话人的有限数据成功实现。为此,基线TTS模型需要充分泛化到域外数据(即目标说话人的语音)。然而,目前对TTS中域外泛化问题的解决方法尚未得到深入研究。本文提出一种基于稀疏注意力的Transformer有效剪枝方法,以提升TTS模型的泛化能力。具体而言,我们对自注意力层中注意力权重低于阈值的冗余连接进行剪枝。为灵活确定剪枝强度以搜索最优泛化程度,我们进一步提出一种可微分剪枝方法,使模型能够自动学习阈值。在零样本多说话人TTS上的评估验证了该方法在语音质量和说话人相似性方面的有效性。