Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis

In speech synthesis, a generative adversarial network (GAN), training a generator (speech synthesizer) and a discriminator in a min-max game, is widely used to improve speech quality. An ensemble of discriminators is commonly used in recent neural vocoders (e.g., HiFi-GAN) and end-to-end text-to-speech (TTS) systems (e.g., VITS) to scrutinize waveforms from multiple perspectives. Such discriminators allow synthesized speech to adequately approach real speech; however, they require an increase in the model size and computation time according to the increase in the number of discriminators. Alternatively, this study proposes a Wave-U-Net discriminator, which is a single but expressive discriminator with Wave-U-Net architecture. This discriminator is unique; it can assess a waveform in a sample-wise manner with the same resolution as the input signal, while extracting multilevel features via an encoder and decoder with skip connections. This architecture provides a generator with sufficiently rich information for the synthesized speech to be closely matched to the real speech. During the experiments, the proposed ideas were applied to a representative neural vocoder (HiFi-GAN) and an end-to-end TTS system (VITS). The results demonstrate that the proposed models can achieve comparable speech quality with a 2.31 times faster and 14.5 times more lightweight discriminator when used in HiFi-GAN and a 1.90 times faster and 9.62 times more lightweight discriminator when used in VITS. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/waveunetd/.

翻译：在语音合成中，生成对抗网络（GAN）通过最小-最大博弈训练生成器（语音合成器）和判别器，被广泛用于提升语音质量。近期，神经声码器（如HiFi-GAN）和端到端文本转语音系统（如VITS）通常采用判别器集成方法，从多个角度对波形进行细致鉴别。这类判别器能使合成语音充分逼近真实语音；然而，判别器数量的增加会导致模型尺寸和计算时间相应增长。为此，本研究提出Wave-U-Net判别器——一种基于Wave-U-Net架构的单一但富有表现力的判别器。该判别器具有独特性：它能够以与输入信号相同的分辨率，逐样本评估波形，同时通过带有跳跃连接的编码器与解码器提取多层级特征。这种架构为生成器提供了足够丰富的信息，使合成语音能够紧密匹配真实语音。在实验中，所提方案被应用于代表性神经声码器（HiFi-GAN）和端到端TTS系统（VITS）。结果表明，使用所提模型时，在HiFi-GAN中判别器的运行速度提升2.31倍、参数量缩减14.5倍；在VITS中运行速度提升1.90倍、参数量缩减9.62倍，同时保持可比的语音质量。音频样本见https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/waveunetd/。