Recent advances in text-to-speech have made it possible to generate natural-sounding audio from text. However, audiobook narrations involve dramatic vocalizations and intonations by the reader, with greater reliance on emotions, dialogues, and descriptions in the narrative. Using our dataset of 93 aligned book-audiobook pairs, we present improved models for prosody prediction properties (pitch, volume, and rate of speech) from narrative text using language modeling. Our predicted prosody attributes correlate much better with human audiobook readings than results from a state-of-the-art commercial TTS system: our predicted pitch shows a higher correlation with human reading for 22 out of the 24 books, while our predicted volume attribute proves more similar to human reading for 23 out of the 24 books. Finally, we present a human evaluation study to quantify the extent that people prefer prosody-enhanced audiobook readings over commercial text-to-speech systems.
翻译:文本转语音领域的最新进展使得从文本生成自然流畅的音频成为可能。然而,有声书朗读涉及朗读者富有戏剧性的发声与语调变化,对叙事中的情感、对话和描述具有更强的依赖性。利用我们构建的包含93对对齐文本-有声书数据集的语料库,我们提出了基于语言建模从叙事文本中预测韵律属性(音高、音量与语速)的改进模型。相比当前最先进的商业文本转语音系统,我们预测的韵律属性与人类有声书朗读的相关性显著更高:在24本书中,预测音高与人类朗读的相关性在22本上表现更优,而预测音量属性在24本书中有23本与人类朗读的相似度更高。最后,我们开展了一项人工评估研究,量化用户对增强韵律的有声书朗读相较于商业文本转语音系统的偏好程度。