Music source separation (MSS) is a task that involves isolating individual sound sources, or stems, from mixed audio signals. This paper presents an ensemble approach to MSS, combining several state-of-the-art architectures to achieve superior separation performance across traditional Vocal, Drum, and Bass (VDB) stems, as well as expanding into second-level hierarchical separation for sub-stems like kick, snare, lead vocals, and background vocals. Our method addresses the limitations of relying on a single model by utilising the complementary strengths of various models, leading to more balanced results across stems. For stem selection, we used the harmonic mean of Signal-to-Noise Ratio (SNR) and Signal-to-Distortion Ratio (SDR), ensuring that extreme values do not skew the results and that both metrics are weighted effectively. In addition to consistently high performance across the VDB stems, we also explored second-level hierarchical separation, revealing important insights into the complexities of MSS and how factors like genre and instrumentation can influence model performance. While the second-level separation results show room for improvement, the ability to isolate sub-stems marks a significant advancement. Our findings pave the way for further research in MSS, particularly in expanding model capabilities beyond VDB and improving niche stem separations such as guitar and piano.
翻译:音乐源分离(MSS)是一项从混合音频信号中分离出独立声源(即声部)的任务。本文提出了一种用于MSS的集成方法,该方法结合了多种先进架构,不仅在传统的声乐、鼓和贝斯(VDB)声部分离上实现了卓越的性能,还进一步扩展至第二级分层分离,用于分离如底鼓、军鼓、主唱和伴唱等子声部。我们的方法通过利用不同模型的互补优势,解决了依赖单一模型的局限性,从而在各声部上取得了更为均衡的结果。在声部选择上,我们采用了信噪比(SNR)与信号失真比(SDR)的调和平均数,以确保极端值不会扭曲结果,并使两个指标均得到有效加权。除了在VDB声部上持续表现出高性能外,我们还探索了第二级分层分离,揭示了MSS复杂性的重要见解,以及流派和乐器配置等因素如何影响模型性能。虽然第二级分离的结果显示仍有改进空间,但分离子声部的能力标志着一项重大进展。我们的发现为MSS的进一步研究铺平了道路,特别是在扩展模型能力超越VDB范畴以及改进如吉他和钢琴等特定声部的分离方面。