Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow.
翻译:尽管扩散模型在文本转语音领域因其强大的生成能力而广受欢迎,但扩散模型采样的固有复杂性降低了其效率。为此,我们提出VoiceFlow声学模型,该模型采用修正流匹配算法,在有限采样步数下实现高质量语音合成。VoiceFlow将梅尔频谱生成过程建模为以文本输入为条件的常微分方程,并对其向量场进行估计。修正流技术能有效平滑其采样轨迹,实现高效合成。在单说话人与多说话人语料库上的主观与客观评估表明,相较于扩散模型,VoiceFlow具有更优的合成质量。消融实验进一步验证了VoiceFlow中修正流技术的有效性。