Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations. While diffusion models are adept at this task, they are hindered by latency issues due to their operation at the individual sample point level and the need for numerous sampling steps. In this study, we introduce RFWave, a cutting-edge multi-band Rectified Flow approach designed to reconstruct high-fidelity audio waveforms from Mel-spectrograms or discrete tokens. RFWave uniquely generates complex spectrograms and operates at the frame level, processing all subbands simultaneously to boost efficiency. Leveraging Rectified Flow, which targets a flat transport trajectory, RFWave achieves reconstruction with just 10 sampling steps. Our empirical evaluations show that RFWave not only provides outstanding reconstruction quality but also offers vastly superior computational efficiency, enabling audio generation at speeds up to 97 times faster than real-time on a GPU. An online demonstration is available at: https://rfwave-demo.github.io/rfwave/.
翻译:近年来,生成建模的进展显著提升了从各种表示中重建音频波形的能力。尽管扩散模型擅长此任务,但由于其在单个样本点层面进行操作且需要大量采样步骤,它们受到延迟问题的困扰。在本研究中,我们提出了RFWave,一种先进的多频段整流流方法,旨在从梅尔频谱图或离散标记中重建高保真音频波形。RFWave独特地生成复杂频谱图并在帧级别操作,同时处理所有子带以提高效率。利用整流流(其目标是平坦的传输轨迹),RFWave仅需10个采样步骤即可实现重建。我们的实证评估表明,RFWave不仅提供了卓越的重建质量,还提供了远胜的计算效率,使得在GPU上音频生成速度可达实时速度的97倍。在线演示请访问:https://rfwave-demo.github.io/rfwave/。