To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or selected by semantic features. However, semantic features contain not only style-related information, but also style irrelevant information. The information irrelevant to speaking style in the text could interfere the reference audio selection and result in improper speaking styles. To improve the reference selection, we propose Contrastive Acoustic-Linguistic Module (CALM) to extract the Style-related Text Feature (STF) from the text. CALM optimizes the correlation between the speaking style embedding and the extracted STF with contrastive learning. Thus, a certain number of the most appropriate reference speeches for the input text are selected by retrieving the speeches with the top STF similarities. Then the style embeddings are weighted summarized according to their STF similarities and used to stylize the synthesized speech of TTS. Experiment results demonstrate the effectiveness of our proposed approach, with both objective evaluations and subjective evaluations on the speaking styles of the synthesized speeches outperform a baseline approach with semantic-feature-based reference selection.
翻译:为了进一步提升合成语音的说话风格,当前文本转语音(TTS)系统通常借助参考语音对输出进行风格化,而非仅依赖输入文本。这些参考语音通过手动选择(耗费资源)或语义特征选择获取。然而,语义特征不仅包含风格相关信息,也包含与风格无关的信息。文本中与说话风格无关的信息会干扰参考音频的选择,导致不恰当的说话风格。为改进参考选择,我们提出对比声学-语言模块(CALM),从文本中提取风格相关文本特征(STF)。CALM通过对比学习优化说话风格嵌入与提取的STF之间的相关性。由此,通过检索STF相似度最高的语音,为输入文本选取适当数量的最相关参考语音。随后,根据STF相似度对风格嵌入进行加权汇总,用于对TTS合成语音进行风格化。实验结果表明,我们提出的方法在合成语音的客观评估与主观评估上均优于基于语义特征进行参考选择的基线方法,验证了方法的有效性。