In this work, we demonstrate the integration of a score-matching diffusion model into a deterministic architecture for time-domain musical source extraction, resulting in enhanced audio quality. To address the typically slow iterative sampling process of diffusion models, we apply consistency distillation and reduce the sampling process to a single step, achieving performance comparable to that of diffusion models, and with two or more steps, even surpassing them. Trained on the Slakh2100 dataset for four instruments (bass, drums, guitar, and piano), our model shows significant improvements across objective metrics compared to baseline methods. Sound examples are available at https://consistency-separation.github.io/.
翻译:本研究将分数匹配扩散模型整合到时域音乐源分离的确定性架构中,显著提升了音频质量。针对扩散模型迭代采样过程缓慢的问题,我们应用一致性蒸馏技术将采样过程缩减至单步,在保持与扩散模型相当性能的同时,通过两步及以上采样甚至能实现性能超越。在Slakh2100数据集上针对四种乐器(贝斯、鼓组、吉他、钢琴)进行训练后,我们的模型在客观指标上较基线方法展现出显著提升。音频示例可通过https://consistency-separation.github.io/获取。