TTS, or text-to-speech, is a complicated process that can be accomplished through appropriate modeling using deep learning methods. In order to implement deep learning models, a suitable dataset is required. Since there is a scarce amount of work done in this field for the Persian language, this paper will introduce the single speaker dataset: ArmanTTS. We compared the characteristics of this dataset with those of various prevalent datasets to prove that ArmanTTS meets the necessary standards for teaching a Persian text-to-speech conversion model. We also combined the Tacotron 2 and HiFi GAN to design a model that can receive phonemes as input, with the output being the corresponding speech. 4.0 value of MOS was obtained from real speech, 3.87 value was obtained by the vocoder prediction and 2.98 value was reached with the synthetic speech generated by the TTS model.
翻译:文本转语音(TTS)是一个复杂的过程,可通过深度学习方法的适当建模来实现。为了实现深度学习模型,需要合适的数据集。鉴于波斯语领域在此方向上的研究较为匮乏,本文介绍了单说话人数据集:ArmanTTS。我们将该数据集的特征与多种常用数据集进行对比,以证明ArmanTTS符合训练波斯语文本转语音转换模型的必要标准。我们还结合Tacotron 2与HiFi GAN设计了一个模型,该模型可接收音素作为输入,并输出对应的语音。真实语音的MOS值为4.0,声码器预测得到的MOS值为3.87,而TTS模型生成的合成语音对应的MOS值为2.98。