Conditional variational autoencoder (cVAE)-based singing voice synthesis provides efficient inference and strong audio quality by learning a score-conditioned prior and a recording-conditioned posterior latent space. However, because synthesis relies on prior samples while training uses posterior latents inferred from real recordings, imperfect distribution matching can cause a prior-posterior mismatch that degrades fine-grained expressiveness such as vibrato and micro-prosody. We propose FM-Singer, which introduces conditional flow matching (CFM) in latent space to learn a continuous vector field transporting prior latents toward posterior latents along an optimal-transport-inspired path. At inference time, the learned latent flow refines a prior sample by solving an ordinary differential equation (ODE) before waveform generation, improving expressiveness while preserving the efficiency of parallel decoding. Experiments on Korean and Chinese singing datasets demonstrate consistent improvements over strong baselines, including lower mel-cepstral distortion and fundamental-frequency error and higher perceptual scores on the Korean dataset. Code, pretrained checkpoints, and audio demos are available at https://github.com/alsgur9368/FM-Singer
翻译:基于条件变分自编码器(cVAE)的歌唱语音合成通过学习分数条件先验和录音条件后验潜在空间,实现了高效的推理和出色的音频质量。然而,由于合成依赖于先验样本,而训练使用从真实录音推断出的后验潜在变量,不完美的分布匹配可能导致先验-后验失配,从而削弱如颤音和微观韵律等细粒度表现力。我们提出FM-Singer,该方法在潜在空间中引入条件流匹配(CFM),以学习一个连续的向量场,该场沿最优传输启发的路径将先验潜在变量传输至后验潜在变量。在推理阶段,学习到的潜在流通过求解常微分方程(ODE)来优化先验样本,再进行波形生成,从而在保持并行解码效率的同时提升表现力。在韩语和中文歌唱数据集上的实验表明,相较于强基线模型,该方法取得了持续改进,包括更低的梅尔倒谱失真和基频误差,以及在韩语数据集上更高的感知评分。代码、预训练检查点和音频示例可在 https://github.com/alsgur9368/FM-Singer 获取。