In this paper, we present MuLanTTS, the Microsoft end-to-end neural text-to-speech (TTS) system designed for the Blizzard Challenge 2023. About 50 hours of audiobook corpus for French TTS as hub task and another 2 hours of speaker adaptation as spoke task are released to build synthesized voices for different test purposes including sentences, paragraphs, homographs, lists, etc. Building upon DelightfulTTS, we adopt contextual and emotion encoders to adapt the audiobook data to enrich beyond sentences for long-form prosody and dialogue expressiveness. Regarding the recording quality, we also apply denoise algorithms and long audio processing for both corpora. For the hub task, only the 50-hour single speaker data is used for building the TTS system, while for the spoke task, a multi-speaker source model is used for target speaker fine tuning. MuLanTTS achieves mean scores of quality assessment 4.3 and 4.5 in the respective tasks, statistically comparable with natural speech while keeping good similarity according to similarity assessment. The excellent and similarity in this year's new and dense statistical evaluation show the effectiveness of our proposed system in both tasks.
翻译:本文介绍了MuLanTTS——为Blizzard Challenge 2023设计的微软端到端神经文本转语音(TTS)系统。系统发布了约50小时的法语音频书语料库作为枢纽任务,以及另外2小时的说话者自适应语料库作为辐条任务,用于构建针对不同测试目的(包括句子、段落、同形异义词、列表等)的合成语音。在DelightfulTTS的基础上,我们采用上下文和情感编码器对音频书数据进行适配,以增强超句子的长格式韵律和对话表现力。针对录音质量,我们还对两个语料库应用了降噪算法和长音频处理技术。对于枢纽任务,仅使用50小时的单说话者数据构建TTS系统;而对于辐条任务,则采用多说话者源模型进行目标说话者微调。MuLanTTS在各自任务中实现了4.3和4.5的平均质量评估得分,与自然语音在统计上相当,同时根据相似度评估保持了良好的相似性。在今年的新型密集统计评估中,系统在两项任务中展现出的优异表现和相似性证明了我们提出的方案的有效性。