Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre. In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style, similar to the one-to-many problem(i.e., multiple prosody variations correspond to the same text). In response to this problem, a strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre, improving the representation and interpretability of the global style embedding, which can alleviate the one-to-many mapping and data imbalance problems in prosody prediction. A hierarchical prosody predictor is proposed to improve prosody modeling. We find that better style transfer can be achieved by using the source speaker's prosody features that are easily predicted. Additionally, a speaker-transfer-wise cycle consistency loss is proposed to assist the model in learning unseen style-timbre combinations during the training phase. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.
翻译:在语音合成的跨说话人风格迁移中,目标是将源说话人的风格迁移至目标说话人音色合成的语音。以往多数方法中,合成语音的细粒度韵律特征常表现为源说话人的平均风格,类似于“一对多”问题(即同一文本对应多种韵律变体)。针对此问题,本文提出了一种强度可控的半监督风格提取器,用于解耦风格与内容及音色,从而提升全局风格嵌入的表征能力与可解释性,缓解韵律预测中的“一对多”映射与数据不平衡问题。同时,还提出了一种层级韵律预测器以改进韵律建模。研究发现,通过使用源说话人易于预测的韵律特征,可实现更优的风格迁移。此外,本文提出了一种面向说话人迁移的循环一致性损失,辅助模型在训练阶段学习未见的风格-音色组合。实验结果表明,该方法优于基线系统,并提供了包含音频样本的演示网站。