We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-inserted synthetic speech is often limited. To address this issue, we present a method for synthesizing spontaneous speech that improves robustness to diverse FP insertions. Regularization is used to stabilize the synthesis of the linguistic speech (i.e., non-FP) elements. To further improve robustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP word prediction model as well as ground-truth FPs. Our experiments demonstrated that the proposed method improves the naturalness of synthetic speech with ground-truth and predicted FPs by 0.24 and 0.26, respectively.
翻译:我们提出了一种结合语言语音正则化的训练方法,该方法通过填充停顿(FP)插入提升自发性语音合成方法的鲁棒性。自发性语音合成旨在生成具有类人不流畅特性(如填充停顿)的语音。由于使用丰富的FP词汇建模自发性语音的复杂数据分布极具挑战性,插入FP的合成语音质量通常受限。为解决这一问题,我们提出一种自发性语音合成方法,通过正则化稳定语言语音(即非FP元素)的合成过程,从而提升对多样化FP插入的鲁棒性。该方法进一步利用FP词预测模型采样的伪FP与真实FP,增强对多样化FP插入的适应能力。实验表明,所提方法在基于真实FP和预测FP的合成语音自然度上分别提升了0.24和0.26。