The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes. The English TTS front-end typically consists of a text normalization (TN) module, a prosody word prosody phrase (PWPP) module, and a grapheme-to-phoneme (G2P) module. However, current research on the English TTS front-end focuses solely on individual modules, neglecting the interdependence between them and resulting in sub-optimal performance for each module. Therefore, this paper proposes a unified front-end framework that captures the dependencies among the English TTS front-end modules. Extensive experiments have demonstrated that the proposed method achieves state-of-the-art (SOTA) performance in all modules.
翻译:前端是英文文本到语音(TTS)系统的关键组成部分,负责提取文本到语音模型合成语音所必需的 linguistic 特征,如韵律和音素。英文 TTS 前端通常包括文本规范化(Text Normalization, TN)模块、韵律词韵律短语(Prosody Word Prosody Phrase, PWPP)模块和字形到音素(Grapheme-to-Phoneme, G2P)模块。然而,当前对英文 TTS 前端的研究仅聚焦于各个独立模块,忽略了它们之间的相互依赖关系,导致各模块性能欠佳。因此,本文提出一种统一的端到端前端框架,用以捕获英文 TTS 前端各模块之间的依赖关系。大量实验表明,所提方法在所有模块上均实现了最优(State-of-the-Art, SOTA)性能。