Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators are rooted in Short-Time Fourier Transform (STFT), whose time-frequency resolution in a spectrogram is fixed, making it incompatible with signals like singing voices that require flexible attention for different frequency bands. Motivated by that, our study utilizes the Constant-Q Transform (CQT), which owns dynamic resolution among frequencies, contributing to a better modeling ability in pitch accuracy and harmonic tracking. Specifically, we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates on the CQT spectrogram at multiple scales and performs sub-band processing according to different octaves. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed method. Moreover, we also verified that the CQT-based and the STFT-based discriminators could be complementary under joint training. Specifically, enhanced by the proposed MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen singers.
翻译:基于生成对抗网络(GAN)的声码器在从声学表征重建可听波形时,在推理速度和合成质量方面具有优越性。本研究聚焦于改进鉴别器以提升基于GAN的声码器性能。现有的大多数基于时频表示的鉴别器根植于短时傅里叶变换(STFT),其声谱图中的时频分辨率是固定的,与需要不同频段灵活关注的信号(如歌声)不兼容。受此启发,本研究利用具有动态频率分辨率的常数Q变换(CQT),从而在音高精度和谐波追踪方面获得更优的建模能力。具体而言,我们提出了一种多尺度子带CQT(MS-SB-CQT)鉴别器,该鉴别器在多个尺度上对CQT声谱图进行操作,并根据不同八度进行子带处理。在语音和歌声上开展的实验证实了我们所提方法的有效性。此外,我们还验证了CQT基和STFT基鉴别器在联合训练下可以互补。具体来说,通过结合所提MS-SB-CQT与现有MS-STFT鉴别器,HiFi-GAN的MOS评分对于已知歌手可从3.27提升至3.87,对于未知歌手可从3.40提升至3.78。