Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer). BS-RoFormer replies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE). The BS-RoFormer system trained on MUSDB18HQ and 500 extra songs ranked the first place in the MSS track of Sound Demixing Challenge (SDX23). Benchmarking a smaller version of BS-RoFormer on MUSDB18HQ, we achieve state-of-the-art result without extra training data, with 9.80 dB of average SDR.
翻译:音乐源分离(MSS)旨在将一段音乐录音分离为多个独立的音乐音轨,例如人声、贝斯、鼓等。近年来,虽然卷积神经网络(CNN)和循环神经网络(RNN)等深度学习方法已被应用,但性能提升仍然有限。本文提出一种基于频带分离旋转位置编码Transformer(称为BS-RoFormer)的新型频域方法。BS-RoFormer依赖频带分离模块将输入复频谱图投影为子带级表征,随后通过层级化Transformer堆叠建模子带内部及子带间序列,以实现多频带掩码估计。为促进MSS模型训练,我们引入旋转位置编码(RoPE)。基于MUSDB18HQ数据集及500首额外歌曲训练的BS-RoFormer系统,在Sound Demixing Challenge(SDX23)的MSS赛道中取得第一名。在不使用额外训练数据的情况下,BS-RoFormer精简版在MUSDB18HQ上达到9.80 dB的平均SDR,实现当前最优结果。