We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-inserted synthetic speech is often limited. To address this issue, we present a method for synthesizing spontaneous speech that improves robustness to diverse FP insertions. Regularization is used to stabilize the synthesis of the linguistic speech (i.e., non-FP) elements. To further improve robustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP word prediction model as well as ground-truth FPs. Our experiments demonstrated that the proposed method improves the naturalness of synthetic speech with ground-truth and predicted FPs by 0.24 and 0.26, respectively.
翻译:本文提出一种结合语言语音正则化的训练方法,通过填充停顿插入技术提升自发语音合成方法的鲁棒性。自发语音合成旨在生成具有人类化不流畅特征(如填充停顿)的语音。由于建模包含丰富填充停顿词汇的自发语音复杂数据分布具有挑战性,插入填充停顿的合成语音质量往往受限。针对该问题,本文提出一种自发语音合成方法,通过正则化技术稳定语言语音(即非填充停顿)成分的合成,从而增强对多样化填充停顿插入的鲁棒性。为进一步适应多样化的填充停顿插入,该方法利用填充停顿词预测模型采样的伪填充停顿与真实填充停顿共同参与训练。实验表明,所提方法在真实填充停顿和预测填充停顿场景下分别使合成语音的自然度提升0.24和0.26。