Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.
翻译:音频超分辨率(SR)旨在将低分辨率(LR)波形上采样至高分辨率(HR)版本。近期,扩散模型与桥接模型已被用于探索该任务,然而现有方法常因生成先验信息不足而导致上采样质量欠佳。为实现高质量的音频超分辨率,我们提出一种基于隐式桥接模型(LBMs)的新系统:首先将音频波形压缩至连续隐空间,随后设计LBM以实现隐空间到隐空间的生成过程。该过程天然匹配LR到HR的上采样流程,从而充分利用LR波形中包含的指导性先验信息。为在HR样本有限的条件下进一步提升训练效果,我们引入频率感知LBM,将先验频率与目标频率作为模型输入,使LBM在训练阶段能够显式学习任意频率到任意频率的上采样过程。此外,我们设计了级联LBM并提出两种先验增强策略,首次实现了超越48 kHz的音频上采样,并构建了无缝级联SR流程,为音频后期制作提供了更高灵活性。在VCTK、ESC-50、Song-Describer基准数据集及两个内部测试集上的综合实验结果表明,我们的方法在语音、音频及音乐信号的任意频率至48kHz SR任务中均取得了最优的客观指标与感知质量,同时创下了任意频率至192kHz音频SR的首个性能记录。演示页面见 https://AudioLBM.github.io/。