Speech synthesis technology has witnessed significant advancements in recent years, enabling the creation of natural and expressive synthetic speech. One area of particular interest is the generation of synthetic child speech, which presents unique challenges due to children's distinct vocal characteristics and developmental stages. This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech. This study uses the transfer learning training pipeline. The approach involved finetuning a multi-speaker TTS model to work with child speech. We use the cleaned version of the publicly available MyST dataset (55 hours) for our finetuning experiments. We also release a prototype dataset of synthetic speech samples generated from this research together with model code to support further research. By using a pretrained MOSNet, we conducted an objective assessment that showed a significant correlation between real and synthetic child voices. Additionally, to validate the intelligibility of the generated speech, we employed an automatic speech recognition (ASR) model to compare the word error rates (WER) of real and synthetic child voices. The speaker similarity between the real and generated speech is also measured using a pretrained speaker encoder.
翻译:语音合成技术近年来取得了显著进展,使得生成自然且富有表现力的合成语音成为可能。其中,合成儿童语音这一特殊领域因其独特的发声特征和发育阶段而面临独特挑战。本文提出了一种创新方法,利用Fastpitch文本转语音(TTS)模型生成高质量合成儿童语音。本研究采用迁移学习训练流程,通过微调多说话人TTS模型使其适应儿童语音特征。我们使用经过清洗的公开MyST数据集(55小时)进行微调实验,同时发布由本研究生成的合成语音样本原型数据集及模型代码,以支持后续研究。通过使用预训练MOSNet进行客观评估,结果表明真实儿童语音与合成儿童语音之间存在显著相关性。此外,为验证生成语音的可懂度,我们采用自动语音识别(ASR)模型对比真实与合成儿童语音的词错误率(WER),并使用预训练说话人编码器测量真实语音与生成语音之间的说话人相似度。