Speech super-resolution (SSR) aims to recover a high resolution (HR) speech from its corresponding low resolution (LR) counterpart. Recent SSR methods focus more on the reconstruction of the magnitude spectrogram, ignoring the importance of phase reconstruction, thereby limiting the recovery quality. To address this issue, we propose mdctGAN, a novel SSR framework based on modified discrete cosine transform (MDCT). By adversarial learning in the MDCT domain, our method reconstructs HR speeches in a phase-aware manner without vocoders or additional post-processing. Furthermore, by learning frequency consistent features with self-attentive mechanism, mdctGAN guarantees a high quality speech reconstruction. For VCTK corpus dataset, the experiment results show that our model produces natural auditory quality with high MOS and PESQ scores. It also achieves the state-of-the-art log-spectral-distance (LSD) performance on 48 kHz target resolution from various input rates. Code is available from https://github.com/neoncloud/mdctGAN
翻译:语音超分辨率(SSR)旨在从对应的低分辨率(LR)语音中恢复出高分辨率(HR)语音。当前的SSR方法更侧重于幅度谱图的重建,忽略了相位重建的重要性,从而限制了恢复质量。为解决此问题,我们提出mdctGAN——一种基于改进型离散余弦变换(MDCT)的新型SSR框架。通过在MDCT域中采用对抗学习,该方法以相位感知方式重建HR语音,无需声码器或额外后处理。此外,通过自注意力机制学习频率一致性特征,mdctGAN保证了高质量语音重建。在VCTK语料库数据集上的实验结果表明,我们的模型在MOS和PESQ评分方面均具有较高的自然听觉质量。在48 kHz目标分辨率下,针对不同输入速率,该方法还实现了最先进的log-spectral-distance(LSD)性能。代码获取地址:https://github.com/neoncloud/mdctGAN