Accent plays a significant role in speech communication, influencing one's capability to understand as well as conveying a person's identity. This paper introduces a novel and efficient framework for accented Text-to-Speech (TTS) synthesis based on a Conditional Variational Autoencoder. It has the ability to synthesize a selected speaker's voice, and convert this to any desired target accent. Our thorough experiments validate the effectiveness of the proposed framework using both objective and subjective evaluations. The results also show remarkable performance in terms of the model's ability to manipulate accents in the synthesized speech. Overall, our proposed framework presents a promising avenue for future accented TTS research.
翻译:口音在语音交流中扮演着重要角色,既影响听者的理解能力,也传递着说话者的身份信息。本文提出了一种基于条件变分自编码器的新型高效带口音文本到语音合成框架。该框架能够合成选定说话人的语音,并将其转换为任意目标口音。我们通过客观与主观评估相结合的全面实验验证了所提框架的有效性。实验结果同时表明,该模型在合成语音中操控口音的能力方面表现出卓越性能。总体而言,我们提出的框架为未来带口音文本到语音合成研究开辟了前景广阔的途径。