Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT). This paradigm introduces an additional task: a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in order to test autoregressive (AR) and non-autoregressive (NAR) speech systems, respectively. The experimental results demonstrate that the speech synthesised using the model with the Mel spectrogram enhancement paradigm exhibits higher MOS, with an improvement of 0.14 and 0.09 compared to the baseline model, respectively. These findings provide some validation for the universality of the enhancement paradigm, as they demonstrate the success of the paradigm in different architectures.
翻译:声学特征在提升合成语音质量方面起着重要作用。当前,梅尔频谱图是大多数声学模型中广泛采用的声学特征。然而,由于其傅里叶变换过程导致的细粒度损失,梅尔频谱图在突变信号中合成的语音清晰度会受到影响。为了获得更精细的梅尔频谱图,我们提出了一种基于连续小波变换的梅尔频谱图增强范式。该范式引入了一项额外任务:生成更精细的小波频谱图,该任务类似于后处理网络,以解码器输出的梅尔频谱图作为输入。为了分别测试自回归和非自回归语音系统,我们选择Tacotron2和Fastspeech2进行实验验证。实验结果表明,采用梅尔频谱图增强范式的模型所合成的语音具有更高的平均意见得分,相较于基线模型分别提升了0.14和0.09。这些发现为该增强范式的普适性提供了一定验证,证明了该范式在不同架构中的成功应用。