Recent advancements in text-to-speech (TTS) systems, such as FastSpeech and StyleSpeech, have significantly improved speech generation quality. However, these models often rely on duration generated by external tools like the Montreal Forced Aligner, which can be time-consuming and lack flexibility. The importance of accurate duration is often underestimated, despite their crucial role in achieving natural prosody and intelligibility. To address these limitations, we propose a novel Aligner-Guided Training Paradigm that prioritizes accurate duration labelling by training an aligner before the TTS model. This approach reduces dependence on external tools and enhances alignment accuracy. We further explore the impact of different acoustic features, including Mel-Spectrograms, MFCCs, and latent features, on TTS model performance. Our experimental results show that aligner-guided duration labelling can achieve up to a 16\% improvement in word error rate and significantly enhance phoneme and tone alignment. These findings highlight the effectiveness of our approach in optimizing TTS systems for more natural and intelligible speech generation.
翻译:近年来,文本转语音(TTS)系统(如FastSpeech和StyleSpeech)在语音生成质量方面取得了显著进步。然而,这些模型通常依赖于外部工具(如蒙特利尔强制对齐器)生成的时长信息,这一过程可能耗时且缺乏灵活性。尽管准确的时长对于实现自然的韵律和清晰度至关重要,但其重要性常被低估。为应对这些局限性,我们提出了一种新颖的对齐器引导训练范式,该范式通过在训练TTS模型之前先训练一个对齐器,来优先确保准确的时长标注。这种方法降低了对外部工具的依赖,并提升了对齐精度。我们进一步探讨了不同声学特征(包括梅尔频谱图、MFCC和潜在特征)对TTS模型性能的影响。实验结果表明,对齐器引导的时长标注可使词错误率提升高达16%,并显著改善音素和声调的对齐效果。这些发现凸显了我们的方法在优化TTS系统以生成更自然、更清晰语音方面的有效性。