Music recordings often suffer from audio quality issues such as excessive reverberation, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization, dynamics, reverb, amplitude, and stereo. Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster's enhanced outputs over other baselines.
翻译:音乐录音常存在音频质量问题,如过度混响、失真、削波、音调失衡以及立体声像变窄等,尤其是在非专业环境下缺乏专业设备或专业知识时制作的作品。这些问题通常需要使用独立的专业工具进行手动调整来纠正。本文介绍了SonicMaster,这是首个用于音乐修复与母带处理的统一生成模型,能够通过基于文本的控制解决广泛的音频伪影问题。SonicMaster可根据自然语言指令进行针对性增强,也可在自动模式下运行以实现通用修复。为训练该模型,我们构建了SonicMaster数据集——一个通过模拟常见退化类型生成的大规模退化与高质量音轨配对数据集,其中使用了属于五个增强类别(均衡、动态、混响、振幅和立体声)的十九种退化函数。我们的方法利用流匹配生成训练范式,学习一种在文本提示引导下将退化输入映射至其净化、母带处理版本的音频转换。客观音频质量指标表明,SonicMaster在所有伪影类别上均能显著提升音质。此外,主观听力测试证实,听众更倾向于选择SonicMaster的增强输出而非其他基线方法。