While recent advances in Text-To-Speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed by the inverse short-time Fourier transform to synthesize waveforms; 2) To compress the model size, we introduce the grouped parameter-sharing mechanism to the text encoder and flow-based model; 3) We further employ the large pre-trained WavLM model for adversarial training to improve synthesis quality. Experimental results show that our model achieves a real-time factor of 0.0139 on an Intel Core i9 CPU, 8.8x faster than the baseline (0.1221), with a 1.6x parameter compression. Objective and subjective evaluations indicate that FLY-TTS exhibits comparable speech quality to the strong baseline.
翻译:尽管近期文本到语音合成领域的研究在生成高质量语音方面取得了显著进展,但针对轻量级和快速模型的研究仍然有限。本文介绍了FLY-TTS,一种基于VITS的新型快速、轻量且高质量的语音合成系统。具体而言:1)我们用ConvNeXt模块替换解码器,该模块先生成傅里叶频谱系数,再通过逆短时傅里叶变换合成波形;2)为压缩模型大小,我们在文本编码器和基于流的模型中引入了分组参数共享机制;3)我们进一步采用大型预训练WavLM模型进行对抗训练,以提升合成质量。实验结果表明,我们的模型在英特尔酷睿i9 CPU上实现了0.0139的实时因子,比基线(0.1221)快8.8倍,同时实现了1.6倍的参数压缩。客观和主观评估均表明,FLY-TTS的语音质量与强基线模型相当。