Audio super-resolution is a fundamental task that predicts high-frequency components for low-resolution audio, enhancing audio quality in digital applications. Previous methods have limitations such as the limited scope of audio types (e.g., music, speech) and specific bandwidth settings they can handle (e.g., 4kHz to 8kHz). In this paper, we introduce a diffusion-based generative model, AudioSR, that is capable of performing robust audio super-resolution on versatile audio types, including sound effects, music, and speech. Specifically, AudioSR can upsample any input audio signal within the bandwidth range of 2kHz to 16kHz to a high-resolution audio signal at 24kHz bandwidth with a sampling rate of 48kHz. Extensive objective evaluation on various audio super-resolution benchmarks demonstrates the strong result achieved by the proposed model. In addition, our subjective evaluation shows that AudioSR can acts as a plug-and-play module to enhance the generation quality of a wide range of audio generative models, including AudioLDM, Fastspeech2, and MusicGen. Our code and demo are available at https://audioldm.github.io/audiosr.
翻译:音频超分辨率是一项基础任务,旨在预测低分辨率音频的高频成分以提升数字应用中的音频质量。现有方法存在局限性,例如可处理的音频类型(如音乐、语音)范围有限,以及特定带宽设置(如4kHz至8kHz)的约束。本文提出一种基于扩散的生成模型AudioSR,能够对包括音效、音乐和语音在内的多种音频类型执行鲁棒的音频超分辨率。具体而言,AudioSR可将带宽范围在2kHz至16kHz内的任意输入音频信号上采样至带宽为24kHz、采样率为48kHz的高分辨率音频信号。在多种音频超分辨率基准上的广泛客观评估表明,所提模型取得了优异的结果。此外,我们的主观评估显示,AudioSR可作为即插即用模块提升包括AudioLDM、Fastspeech2和MusicGen在内的多种音频生成模型的生成质量。我们的代码与演示可访问 https://audioldm.github.io/audiosr。