Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i.e. speaker similarity) and eliminate the accents from their first language(i.e. nativeness). In this paper, we demonstrated that vector-quantized(VQ) acoustic feature contains less speaker information than mel-spectrogram. Based on this finding, we propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style. Here, one embedding is fed to the acoustic model to learn the linguistic speaking style, while the other one is integrated into the vocoder to mimic the target speaker's timbre. Experiments show that by combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis, especially in terms of nativeness.
翻译:尽管在同语内语音合成中能获得高保真语音,但跨语言文本到语音合成(CTTS)仍远未令人满意,原因在于难以精确保留说话人的音色(即说话人相似度)并消除其母语口音(即地道性)。本文证明,向量量化(VQ)声学特征较梅尔频谱包含更少的说话人信息。基于此发现,我们提出一种新颖的用于CTTS的双说话人嵌入TTS(DSE-TTS)框架,可生成具有地道说话风格的语音。其中,一个嵌入被输入声学模型以学习语言层面的说话风格,另一个嵌入则集成至声码器以模仿目标说话人的音色。实验表明,通过结合两种嵌入,DSE-TTS在跨语言合成任务中显著优于当前最先进的SANE-TTS,尤其在地道性方面表现突出。