Several fully end-to-end text-to-speech (TTS) models have been proposed that have shown better performance compared to cascade models (i.e., training acoustic and vocoder models separately). However, they often generate unstable pitch contour with audible artifacts when the dataset contains emotional attributes, i.e., large diversity of pronunciation and prosody. To address this problem, we propose Period VITS, a novel end-to-end TTS model that incorporates an explicit periodicity generator. In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text. From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch. Finally, the entire model is jointly optimized in an end-to-end manner with variational inference and adversarial objectives. As a result, the decoder becomes capable of generating more stable, expressive, and natural output waveforms. The experimental results showed that the proposed model significantly outperforms baseline models in terms of naturalness, with improved pitch stability in the generated samples.
翻译:目前已有多项全端到端文本转语音(TTS)模型被提出,其性能优于级联模型(即分别训练声学模型和声码器模型)。然而,当数据集包含情感属性(即发音与韵律具有高度多样性)时,此类模型常生成带有可听伪影的不稳定音高轮廓。为解决该问题,我们提出Period VITS——一种融合显式周期性生成器的新型端到端TTS模型。该方法引入帧级音高预测器,用于从输入文本中预测韵律特征(如音高与发声标志)。基于这些特征,所提出的周期性生成器生成样本级正弦声源,使波形解码器能够精确重构音高。最终,通过变分推理与对抗训练目标,整个模型以端到端方式联合优化。实验结果表明,所提模型在自然度指标上显著优于基线模型,且生成样本的音高稳定性得以提升。