Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis

Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre. In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style, similar to the one-to-many problem(i.e., multiple prosody variations correspond to the same text). In response to this problem, a strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre, improving the representation and interpretability of the global style embedding, which can alleviate the one-to-many mapping and data imbalance problems in prosody prediction. A hierarchical prosody predictor is proposed to improve prosody modeling. We find that better style transfer can be achieved by using the source speaker's prosody features that are easily predicted. Additionally, a speaker-transfer-wise cycle consistency loss is proposed to assist the model in learning unseen style-timbre combinations during the training phase. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.

翻译：在语音合成的跨说话人风格迁移中，目标是将源说话人的风格迁移至目标说话人音色合成的语音。以往多数方法中，合成语音的细粒度韵律特征常表现为源说话人的平均风格，类似于“一对多”问题（即同一文本对应多种韵律变体）。针对此问题，本文提出了一种强度可控的半监督风格提取器，用于解耦风格与内容及音色，从而提升全局风格嵌入的表征能力与可解释性，缓解韵律预测中的“一对多”映射与数据不平衡问题。同时，还提出了一种层级韵律预测器以改进韵律建模。研究发现，通过使用源说话人易于预测的韵律特征，可实现更优的风格迁移。此外，本文提出了一种面向说话人迁移的循环一致性损失，辅助模型在训练阶段学习未见的风格-音色组合。实验结果表明，该方法优于基线系统，并提供了包含音频样本的演示网站。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

ChatGPT大模型全栈技术讲解！霍普金斯最新《NLP：自监督模型》2023课程全面讲解预训练指令学习和RLHF等技术，附讲义

专知会员服务

108+阅读 · 2023年4月8日

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

最新《自监督表示学习》报告，70页ppt

专知会员服务

86+阅读 · 2020年12月22日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日