Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.
翻译:基于生成对抗网络(GAN)的声码器在从声学表示重建可听波形时,在推理速度和合成质量方面均具有优势。本研究聚焦于改进基于GAN的声码器鉴别器。现有大多数基于时频表示(TFR)的鉴别器均源于短时傅里叶变换(STFT),其具有恒定时频(TF)分辨率、线性缩放中心频率及固定分解基,因而不适用于需要对不同频段和不同时间区间进行动态关注的信号(如歌唱声)。受此启发,我们提出了一种多尺度子带常数Q变换CQT鉴别器(MS-SB-CQT)和一种多尺度时间压缩连续小波变换CWT鉴别器(MS-TC-CWT)。CQT和CWT均具有针对不同频段的动态时频分辨率,其中CQT对音高信息具有更优建模能力,而CWT对短时瞬态信号具有更优建模能力。在语音和歌唱声上开展的实验证实了所提鉴别器的有效性。此外,基于STFT、CQT和CWT的鉴别器可联合使用以获得更优性能。所提出的鉴别器能够提升多种最先进基于GAN的声码器(包括HiFi-GAN、BigVGAN和APNet)的合成质量。