Diffusion-based audio and music generation models commonly perform generation by constructing an image representation of audio (e.g., a mel-spectrogram) and then convert it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their usefulness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth extension, and upmixes to stereophonic audio. Compared to past work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at \url{https://MusicHiFi.github.io/web/}.
翻译:基于扩散的音频与音乐生成模型通常通过构建音频的图像表示(如梅尔频谱图)并利用相位重建模型或声码器将其转换为音频。然而,典型的声码器仅能生成较低分辨率(如16-24 kHz)的单声道音频,这限制了其实用性。我们提出MusicHiFi——一种高效的高保真立体声声码器。该方法采用三级生成对抗网络(GAN)级联,将低分辨率梅尔频谱图转换为音频,通过带宽扩展上采样至高分辨率音频,并上混为立体声音频。相较于以往工作,我们提出了:1)针对级联中每一阶段的统一GAN生成器与判别器架构及训练流程;2)一种新的快速、近乎下采样兼容的带宽扩展模块;3)一种新的快速下混兼容的单声道至立体声上混器,确保输出中单声道内容的保留。我们通过客观与主观听感测试评估本方法,结果表明相较于已有工作,本方法在音频质量上达到相当或更优水平,具有更好的空间化控制能力,且推理速度显著提升。音频示例请访问 \url{https://MusicHiFi.github.io/web/}。