Flow matching offers a robust and stable approach to training diffusion models. However, directly applying flow matching to neural vocoders can result in subpar audio quality. In this work, we present WaveFM, a reparameterized flow matching model for mel-spectrogram conditioned speech synthesis, designed to enhance both sample quality and generation speed for diffusion vocoders. Since mel-spectrograms represent the energy distribution of waveforms, WaveFM adopts a mel-conditioned prior distribution instead of a standard Gaussian prior to minimize unnecessary transportation costs during synthesis. Moreover, while most diffusion vocoders rely on a single loss function, we argue that incorporating auxiliary losses, including a refined multi-resolution STFT loss, can further improve audio quality. To speed up inference without degrading sample quality significantly, we introduce a tailored consistency distillation method for WaveFM. Experiment results demonstrate that our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders, while enabling waveform generation in a single inference step.
翻译:流匹配为扩散模型的训练提供了一种鲁棒且稳定的方法。然而,将流匹配直接应用于神经声码器可能导致音频质量欠佳。在本工作中,我们提出了WaveFM,一种用于梅尔频谱图条件语音合成的重参数化流匹配模型,旨在提升扩散声码器的样本质量和生成速度。由于梅尔频谱图表示波形的能量分布,WaveFM采用梅尔条件先验分布而非标准高斯先验,以最小化合成过程中不必要的传输成本。此外,尽管大多数扩散声码器依赖于单一损失函数,我们认为引入辅助损失(包括一种改进的多分辨率短时傅里叶变换损失)可以进一步提升音频质量。为了在不显著降低样本质量的前提下加速推理,我们为WaveFM引入了一种定制的保持一致性蒸馏方法。实验结果表明,与先前的扩散声码器相比,我们的模型在质量和效率上均实现了更优的性能,同时能够在单次推理步骤中生成波形。