Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters.
翻译:分离音乐混合中的独立成分是音乐分析与实践中的关键过程。尽管通常采用优化后的神经网络通过掩蔽或变换混合信号的时频表示来提取目标声源,但生成式扩散模型的灵活性与泛化能力为这一复杂任务带来了新型解决方案。本研究探索利用扩散模型从真实音乐录音中分离歌唱人声,该模型通过训练实现以对应混合信号为条件生成独唱人声。我们的方法改进了先前的生成式系统,并在补充数据训练下取得了与非生成式基线模型具有竞争力的客观评分。扩散采样过程的迭代特性使用户能够权衡质量与效率,并在需要时优化输出结果。我们通过消融实验分析了采样算法,重点阐明了用户可配置参数的影响效应。