The FruitShell French synthesis system at the Blizzard 2023 Challenge

This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023. The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals. Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phonemes and eliminated symbols that had no pronunciation or zero duration. Additionally, we added word boundary and start/end symbols to the text, which we have found to improve speech quality based on our previous experience. For the Spoke task, we performed data augmentation according to the competition rules. We used an open-source G2P model to transcribe the French texts into phonemes. As the G2P model uses the International Phonetic Alphabet (IPA), we applied the same transcription process to the provided competition data for standardization. However, due to compiler limitations in recognizing special symbols from the IPA chart, we followed the rules to convert all phonemes into the phonetic scheme used in the competition data. Finally, we resampled all competition audio to a uniform sampling rate of 16 kHz. We employed a VITS-based acoustic model with the hifigan vocoder. For the Spoke task, we trained a multi-speaker model and incorporated speaker information into the duration predictor, vocoder, and flow layers of the model. The evaluation results of our system showed a quality MOS score of 3.6 for the Hub task and 3.4 for the Spoke task, placing our system at an average level among all participating teams.

翻译：本文介绍了参加Blizzard 2023挑战赛的法语文本语音合成系统。该挑战包含两个任务：生成女性说话人的高质量语音，以及生成与特定个体高度相似的语音。针对竞赛数据，我们进行了筛选处理，以剔除缺失或错误的文本数据。我们整理了除音素外的所有符号，并剔除了无发音或时长为零的符号。此外，我们在文本中添加了词边界和起始/结束符号，根据以往经验，这有助于提升语音质量。对于Spoke任务，我们根据竞赛规则进行了数据增强。我们使用开源G2P模型将法语文本转写为音素。由于该G2P模型使用国际音标，我们对提供的竞赛数据采用相同的转写流程以确保标准化。然而，受限于编译器无法识别国际音标表中的特殊符号，我们遵循规则将所有音素转换为竞赛数据使用的音标体系。最后，我们将所有竞赛音频重采样至统一采样率16 kHz。我们采用基于VITS的声学模型，并搭配HiFi-GAN声码器。对于Spoke任务，我们训练了多说话人模型，并将说话人信息融入时长预测器、声码器和flow层。系统评估结果显示，Hub任务的质量MOS得分为3.6，Spoke任务为3.4，使我们的系统在所有参赛团队中处于中等水平。