The FruitShell French synthesis system at the Blizzard 2023 Challenge

This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023. The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals. Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phonemes and eliminated symbols that had no pronunciation or zero duration. Additionally, we added word boundary and start/end symbols to the text, which we have found to improve speech quality based on our previous experience. For the Spoke task, we performed data augmentation according to the competition rules. We used an open-source G2P model to transcribe the French texts into phonemes. As the G2P model uses the International Phonetic Alphabet (IPA), we applied the same transcription process to the provided competition data for standardization. However, due to compiler limitations in recognizing special symbols from the IPA chart, we followed the rules to convert all phonemes into the phonetic scheme used in the competition data. Finally, we resampled all competition audio to a uniform sampling rate of 16 kHz. We employed a VITS-based acoustic model with the hifigan vocoder. For the Spoke task, we trained a multi-speaker model and incorporated speaker information into the duration predictor, vocoder, and flow layers of the model. The evaluation results of our system showed a quality MOS score of 3.6 for the Hub task and 3.4 for the Spoke task, placing our system at an average level among all participating teams.

翻译：本文介绍了为Blizzard Challenge 2023开发的法语文本转语音合成系统。该挑战包含两项任务：生成高质量的女声语音，以及生成与特定说话人高度相似的语音。针对竞赛数据，我们进行了筛选以剔除缺失或错误的文本数据。我们整理了除音素外的所有符号，并删除了无发音或零时长的符号。此外，我们在文本中添加了词边界及起始/结束符号——根据我们以往的经验，这有助于提升语音质量。针对Spoke任务，我们依据竞赛规则进行了数据增强。我们采用开源的G2P模型将法语文本转写为音素。由于该G2P模型使用国际音标（IPA），我们对提供的竞赛数据也实施了相同的转写流程以实现标准化。然而，由于编译器对IPA表中特殊符号的识别存在限制，我们按照规则将所有音素转换为竞赛数据所使用的音标方案。最后，我们将所有竞赛音频重采样至统一的16 kHz采样率。我们采用了基于VITS的声学模型配合hifigan声码器。对于Spoke任务，我们训练了多说话人模型，并将说话人信息融入模型的时长预测器、声码器及流层中。系统评估结果显示，在Hub任务中我们的质量MOS得分为3.6，Spoke任务中为3.4，在所有参赛团队中处于平均水平。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日