While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditional inputs, it still leaves scope for richer representations. As a part of this work, we leverage representations from various Self-Supervised Learning (SSL) models to enhance the quality of the synthesized speech. In particular, we pass the FastSpeech2 encoder's length-regulated outputs through a series of encoder layers with the objective of reconstructing the SSL representations. In the SALTTS-parallel implementation, the representations from this second encoder are used for an auxiliary reconstruction loss with the SSL features. The SALTTS-cascade implementation, however, passes these representations through the decoder in addition to having the reconstruction loss. The richness of speech characteristics from the SSL features reflects in the output speech quality, with the objective and subjective evaluation measures of the proposed approach outperforming the baseline FastSpeech2.
翻译:尽管FastSpeech2旨在将音高、能量和时长等语音方面作为条件输入整合,但其仍存在改进空间,以获取更丰富的表示。在本工作中,我们利用各种自监督学习(SSL)模型的表示来提升合成语音的质量。具体而言,我们将FastSpeech2编码器经长度调节后的输出通过一系列编码器层,以重建SSL表示为目标。在SALTTS-并行实现中,来自该第二编码器的表示与SSL特征共同用于辅助重建损失。而在SALTTS-级联实现中,除了添加重建损失外,这些表示还被传入解码器。SSL特征所蕴含的丰富语音特性反映在输出语音质量上,客观和主观评估指标表明,所提出的方法在性能上优于基线FastSpeech2。