Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multi-speaker speech corpus. To this end, we propose more powerful pause insertion frameworks based on a pre-trained language model. Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus, injecting speaker embedding to capture various speaker characteristics. We also leverage duration-aware pause insertion for more natural multi-speaker TTS. We develop and evaluate two types of models. The first improves conventional phrasing models on the position prediction of respiratory pauses (RPs), i.e., silent pauses at word transitions without punctuation. It performs speaker-conditioned RP prediction considering contextual information and is used to demonstrate the effect of speaker information on the prediction. The second model is further designed for phoneme-based TTS models and performs duration-aware pause insertion, predicting both RPs and punctuation-indicated pauses (PIPs) that are categorized by duration. The evaluation results show that our models improve the precision and recall of pause insertion and the rhythm of synthetic speech.

翻译：停顿插入（亦称短语边界预测或分句）是语音合成系统的重要组成部分，因为自然时长的合理停顿能显著增强合成语音的韵律和清晰度。然而传统分句模型忽视了不同说话人插入静默停顿的差异化风格，这可能导致基于多说话人语音语料库训练的模型性能下降。为此，我们提出基于预训练语言模型的更强大停顿插入框架。该方法采用在大规模文本语料上预训练的BERT（Transformer双向编码器表示），通过注入说话人嵌入来捕捉不同说话人特征，并利用时长感知的停顿插入实现更自然的多说话人语音合成。我们开发并评估了两种模型：第一种模型改进了传统分句模型对呼吸停顿位置（即无标点符号的词间无声停顿）的预测能力，通过考虑上下文信息实现说话人条件化的呼吸停顿预测，用以验证说话人信息对预测的影响；第二种模型专为基于音素的语音合成系统设计，通过时长感知的停顿插入，同时预测呼吸停顿和按时长分类的标点提示停顿。评估结果表明，我们的模型提升了停顿插入的精确率与召回率，并改善了合成语音的韵律表现。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日