Vocal recordings on consumer devices commonly suffer from multiple concurrent degradations: noise, reverberation, band-limiting, and clipping. We present Smule Renaissance Small (SRS), a compact single-stage model that performs end-to-end vocal restoration directly in the complex STFT domain. By incorporating phase-aware losses, SRS enables large analysis windows for improved frequency resolution while achieving 10.5x real-time inference on iPhone 12 CPU at 48 kHz. On the DNS 5 Challenge blind set, despite no speech training, SRS outperforms a strong GAN baseline and closely matches a computationally expensive flow-matching system. To enable evaluation under realistic multi-degradation scenarios, we introduce the Extreme Degradation Bench (EDB): 87 singing and speech recordings captured under severe acoustic conditions. On EDB, SRS surpasses all open-source baselines on singing and matches commercial systems, while remaining competitive on speech despite no speech-specific training. We release both SRS and EDB under the MIT License.
翻译:消费级设备上的人声录音通常同时存在多种退化问题:噪声、混响、频带限制与削波失真。本文提出Smule Renaissance Small(SRS),一种在复数STFT域直接进行端到端人声修复的紧凑型单阶段模型。通过引入相位感知损失函数,SRS能够采用更大的分析窗以提升频率分辨率,同时在iPhone 12 CPU上实现48 kHz采样率下10.5倍实时推理速度。在DNS 5挑战赛盲测集上,尽管未经过语音数据训练,SRS仍超越强GAN基线模型,并与计算成本高昂的流匹配系统性能相当。为构建真实多退化场景的评估体系,我们提出了极端退化测试基准(EDB):包含87段在严苛声学条件下采集的歌唱与语音录音。在EDB测试中,SRS在歌唱任务上超越所有开源基线模型并与商业系统性能持平;尽管未进行语音专项训练,其在语音任务上仍保持竞争力。SRS模型与EDB数据集均基于MIT许可证开源发布。