An end-to-end (e2e) text-to-speech (TTS) system is a deep architecture that learns to associate a text string with acoustic speech patterns from a curated dataset. It is expected that all aspects associated with speech production, such as phone duration, speaker characteristics, and intonation among other things are captured in the trained TTS model to enable the synthesized speech to be natural and intelligible. Human speech is complex, involving smooth transitions between articulatory configurations (ACs). Due to anatomical constraints, some ACs are challenging to mimic or transition between. In this paper, we experimentally study if the constraints imposed by human anatomy have an implication on training an e2e-TTS systems. We experiment with two e2e-TTS architectures, namely, Tacotron-2 an autoregressive model and VITS-TTS a non-autoregressive model. In this study, we build TTS systems using (a) forward text, forward speech (conventional, e2e-TTS), (b) reverse text, reverse speech (r-e2e-TTS), and (c) reverse text, forward speech (rtfs-e2e-TTS). Experiments demonstrate that e2e-TTS systems are purely data-driven. Interestingly, the generated speech by r-e2e-TTS systems exhibits better fidelity, better perceptual intelligibility, and better naturalness
翻译:端到端文本转语音系统是一种通过深度架构从精选数据集中学习文本字符串与声学语音模式关联的技术。训练完成的TTS模型预期能捕捉语音产生的所有层面,包括音素时长、说话人特征、语调等要素,从而使合成语音具备自然度与可懂度。人类语音具有复杂性,涉及发音构型之间的平滑过渡。受解剖结构限制,某些发音构型难以模仿或在彼此间转换。本文通过实验研究人类解剖结构施加的约束是否会影响端到端TTS系统的训练。我们采用两种端到端TTS架构进行实验:自回归模型Tacotron-2与非自回归模型VITS-TTS。本研究构建了三种TTS系统:(a)正向文本-正向语音(传统端到端TTS),(b)反向文本-反向语音(反向端到端TTS),(c)反向文本-正向语音(反向文本正向语音端到端TTS)。实验表明端到端TTS系统完全由数据驱动。值得注意的是,反向端到端TTS系统生成的语音展现出更高的保真度、更优的感知可懂度及更强的自然性。