In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.
翻译:本文提出了一种无声码器的音频超分辨率框架,该框架采用流匹配生成模型来捕捉复数值频谱系数的条件分布。与传统的基于扩散的两阶段方法(先预测梅尔频谱图,再依赖预训练的神经声码器合成波形)不同,我们的方法通过逆短时傅里叶变换直接重建波形,从而消除了对独立声码器的依赖。这种设计不仅简化了端到端优化,而且克服了两阶段流程的一个关键瓶颈,即最终音频质量从根本上受限于声码器性能。实验表明,我们的模型在不同上采样因子下均能持续生成高保真的48 kHz音频,在语音和通用音频数据集上均达到了最先进的性能。