The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes. The English TTS front-end typically consists of a text normalization (TN) module, a prosody word prosody phrase (PWPP) module, and a grapheme-to-phoneme (G2P) module. However, current research on the English TTS front-end focuses solely on individual modules, neglecting the interdependence between them and resulting in sub-optimal performance for each module. Therefore, this paper proposes a unified front-end framework that captures the dependencies among the English TTS front-end modules. Extensive experiments have demonstrated that the proposed method achieves state-of-the-art (SOTA) performance in all modules.
翻译:前端是英文文本到语音(TTS)系统的关键组成部分,负责提取合成语音所必需的语音特征,如韵律和音素。英文TTS前端通常包含文本规范化(TN)模块、韵律词-韵律短语(PWPP)模块以及字形到音素(G2P)模块。然而,当前对英文TTS前端的研究仅聚焦于单个模块,忽视了它们之间的相互依赖性,导致每个模块性能欠佳。因此,本文提出了一种统一的前端框架,能够捕捉英文TTS前端各模块之间的依赖关系。大量实验表明,所提方法在所有模块中均达到了最先进的性能(SOTA)。