Speech super-resolution (SSR) aims to recover a high resolution (HR) speech from its corresponding low resolution (LR) counterpart. Recent SSR methods focus more on the reconstruction of the magnitude spectrogram, ignoring the importance of phase reconstruction, thereby limiting the recovery quality. To address this issue, we propose mdctGAN, a novel SSR framework based on modified discrete cosine transform (MDCT). By adversarial learning in the MDCT domain, our method reconstructs HR speeches in a phase-aware manner without vocoders or additional post-processing. Furthermore, by learning frequency consistent features with self-attentive mechanism, mdctGAN guarantees a high quality speech reconstruction. For VCTK corpus dataset, the experiment results show that our model produces natural auditory quality with high MOS and PESQ scores. It also achieves the state-of-the-art log-spectral-distance (LSD) performance on 48 kHz target resolution from various input rates. Code is available from https://github.com/neoncloud/mdctGAN
翻译:语音超分辨率(SSR)旨在从对应的低分辨率(LR)语音中恢复高分辨率(HR)语音。现有的SSR方法更侧重于幅度谱的重建,忽视了相位重建的重要性,从而限制了恢复质量。为解决此问题,我们提出mdctGAN,一种基于改进离散余弦变换(MDCT)的新型SSR框架。通过在MDCT域中进行对抗学习,我们的方法以相位感知方式重建HR语音,无需声码器或额外的后处理。此外,通过自注意力机制学习频率一致特征,mdctGAN保证了高质量的语音重建。在VCTK语料库数据集上,实验结果表明我们的模型以较高的MOS和PESQ得分产生自然的听觉质量。同时,在48 kHz目标分辨率下,面对不同输入采样率,该方法在对数谱距离(LSD)指标上达到了最先进的性能。代码可从https://github.com/neoncloud/mdctGAN获取。