Flow-based generative models are widely used in text-to-speech (TTS) systems to learn the distribution of audio features (e.g., Mel-spectrograms) given the input tokens and to sample from this distribution to generate diverse utterances. However, in the zero-shot multi-speaker TTS scenario, the generated utterances lack diversity and naturalness. In this paper, we propose to improve the diversity of utterances by explicitly learning the distribution of fundamental frequency sequences (pitch contours) of each speaker during training using a stochastic flow-based pitch predictor, then conditioning the model on generated pitch contours during inference. The experimental results demonstrate that the proposed method yields a significant improvement in the naturalness and diversity of speech generated by a Glow-TTS model that uses explicit stochastic pitch prediction, over a Glow-TTS baseline and an improved Glow-TTS model that uses a stochastic duration predictor.
翻译:基于流的生成模型广泛用于文本转语音系统,用于学习输入标记条件下音频特征(如梅尔频谱图)的分布,并通过从该分布中采样生成多样化的语音。然而,在零样本多说话人TTS场景中,生成的语音缺乏多样性和自然度。本文提出通过在训练过程中使用随机流基频预测器显式学习每个说话人的基频序列(音高轮廓)分布,并在推理时使模型基于生成的音高轮廓进行条件处理,从而提升语音多样性。实验结果表明,相较于Glow-TTS基线模型及使用随机时长预测器的改进型Glow-TTS模型,采用显式随机基频预测的Glow-TTS模型在语音自然度与多样性方面均取得显著提升。