This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023. The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals. Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phonemes and eliminated symbols that had no pronunciation or zero duration. Additionally, we added word boundary and start/end symbols to the text, which we have found to improve speech quality based on our previous experience. For the Spoke task, we performed data augmentation according to the competition rules. We used an open-source G2P model to transcribe the French texts into phonemes. As the G2P model uses the International Phonetic Alphabet (IPA), we applied the same transcription process to the provided competition data for standardization. However, due to compiler limitations in recognizing special symbols from the IPA chart, we followed the rules to convert all phonemes into the phonetic scheme used in the competition data. Finally, we resampled all competition audio to a uniform sampling rate of 16 kHz. We employed a VITS-based acoustic model with the hifigan vocoder. For the Spoke task, we trained a multi-speaker model and incorporated speaker information into the duration predictor, vocoder, and flow layers of the model. The evaluation results of our system showed a quality MOS score of 3.6 for the Hub task and 3.4 for the Spoke task, placing our system at an average level among all participating teams.
翻译:本文介绍了为Blizzard Challenge 2023开发的法语文本转语音合成系统。该挑战包含两项任务:生成高质量的女声语音,以及生成与特定说话人高度相似的语音。针对竞赛数据,我们进行了筛选以剔除缺失或错误的文本数据。我们整理了除音素外的所有符号,并删除了无发音或零时长的符号。此外,我们在文本中添加了词边界及起始/结束符号——根据我们以往的经验,这有助于提升语音质量。针对Spoke任务,我们依据竞赛规则进行了数据增强。我们采用开源的G2P模型将法语文本转写为音素。由于该G2P模型使用国际音标(IPA),我们对提供的竞赛数据也实施了相同的转写流程以实现标准化。然而,由于编译器对IPA表中特殊符号的识别存在限制,我们按照规则将所有音素转换为竞赛数据所使用的音标方案。最后,我们将所有竞赛音频重采样至统一的16 kHz采样率。我们采用了基于VITS的声学模型配合hifigan声码器。对于Spoke任务,我们训练了多说话人模型,并将说话人信息融入模型的时长预测器、声码器及流层中。系统评估结果显示,在Hub任务中我们的质量MOS得分为3.6,Spoke任务中为3.4,在所有参赛团队中处于平均水平。