Expressive speech synthesis requires vibrant prosody and well-timed pauses. We propose an effective strategy to augment a small dataset to train an expressive end-to-end Text-to-Speech model. We merge audios of emotionally congruent text using a text emotion recognizer, creating augmented expressive speech data. By training with two-sentence audio, our model learns natural breaks between lines. We further apply self-supervised contrastive training to improve the speaking style embedding extraction from speech. During inference, our model produces multi-sentence speech in one step, guided by the text-predicted speaking style. Evaluations showcase the effectiveness of our proposed approach when compared to a baseline model trained with consecutive two-sentence audio. Our synthesized speeches give a closer inter-sentence pause distribution to the ground truth speech. Subjective evaluations reveal our synthesized speech scored higher in naturalness and style suitability than the baseline.
翻译:富有表现力的语音合成需要生动的韵律与恰当时机的停顿。本文提出一种有效策略,通过扩增小型数据集来训练富有表现力的端到端文本转语音模型。我们利用文本情绪识别器合并情绪一致的文本音频,从而构建增强的表现性语音数据。通过使用双句音频进行训练,我们的模型能够学习语句间的自然停顿。我们进一步应用自监督对比训练以改进从语音中提取说话风格嵌入。在推理阶段,我们的模型在文本预测的说话风格引导下,可一步生成多语句语音。评估结果表明,与使用连续双句音频训练的基线模型相比,所提方法具有显著优势。合成语音的句间停顿分布更接近真实语音。主观评估显示,合成语音在自然度与风格适配性上均获得高于基线的评分。