The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes. The English TTS front-end typically consists of a text normalization (TN) module, a prosody word prosody phrase (PWPP) module, and a grapheme-to-phoneme (G2P) module. However, current research on the English TTS front-end focuses solely on individual modules, neglecting the interdependence between them and resulting in sub-optimal performance for each module. Therefore, this paper proposes a unified front-end framework that captures the dependencies among the English TTS front-end modules. Extensive experiments have demonstrated that the proposed method achieves state-of-the-art (SOTA) performance in all modules.
翻译:前端是英文文本到语音(TTS)系统的关键组成部分,负责提取文本到语音模型合成语音所需的语言学特征,如韵律和音素。英文TTS前端通常包括文本规范化(TN)模块、韵律词韵律短语(PWPP)模块和字形到音素(G2P)模块。然而,当前关于英文TTS前端的研究仅关注单个模块,忽略了模块间的相互依赖关系,导致各模块性能欠佳。因此,本文提出了一种统一的前端框架,该框架能够捕捉英文TTS前端模块之间的依赖关系。大量实验表明,所提出的方法在所有模块上均达到了最先进的性能。