This paper describes the DeepZen text to speech (TTS) system for Blizzard Challenge 2023. The goal of this challenge is to synthesise natural and high-quality speech in French, from a large monospeaker dataset (hub task) and from a smaller dataset by speaker adaptation (spoke task). We participated to both tasks with the same model architecture. Our approach has been to use an auto-regressive model, which retains an advantage for generating natural sounding speech but to improve prosodic control in several ways. Similarly to non-attentive Tacotron, the model uses a duration predictor and gaussian upsampling at inference, but with a simpler unsupervised training. We also model the speaking style at both sentence and word levels by extracting global and local style tokens from the reference speech. At inference, the global and local style tokens are predicted from a BERT model run on text. This BERT model is also used to predict specific pronunciation features like schwa elision and optional liaisons. Finally, a modified version of HifiGAN trained on a large public dataset and fine-tuned on the target voices is used to generate speech waveform. Our team is identified as O in the the Blizzard evaluation and MUSHRA test results show that our system performs second ex aequo in both hub task (median score of 0.75) and spoke task (median score of 0.68), over 18 and 14 participants, respectively.
翻译:本文介绍了 DeepZen 针对 Blizzard Challenge 2023 开发的文本转语音(TTS)系统。本次挑战赛的目标是利用大规模单说话人数据集(枢纽任务)以及通过说话人自适应技术从较小数据集(分支任务)合成自然且高质量的法语语音。我们采用相同的模型架构参与了两项任务。我们的方法基于自回归模型,该模型在生成自然语音方面具有优势,但通过多种方式改进了韵律控制。与非注意力式 Tacotron 类似,本模型在推理时使用时长预测器和高斯上采样,但采用了更简单的无监督训练方式。我们还通过从参考语音中提取全局与局部风格令牌,在句子和单词层面建模说话风格。在推理阶段,全局与局部风格令牌由基于文本运行的 BERT 模型预测得出。该 BERT 模型还被用于预测特定的发音特征,如元音省略和可选连诵。最后,使用经过大规模公共数据集训练并在目标语音上微调的改良版 HifiGAN 生成语音波形。我们的团队在 Blizzard 评估中标识为 O,MUSHRA 测试结果显示,我们的系统在枢纽任务(中位评分 0.75)和分支任务(中位评分 0.68)中均并列第二,参与系统数分别为 18 个和 14 个。