Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer). BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE). The BS-RoFormer system trained on MUSDB18HQ and 500 extra songs ranked the first place in the MSS track of Sound Demixing Challenge (SDX23). Benchmarking a smaller version of BS-RoFormer on MUSDB18HQ, we achieve state-of-the-art result without extra training data, with 9.80 dB of average SDR.
翻译:音乐源分离(MSS)旨在将音乐录音分离为多个具有音乐独立性的音轨,如人声、贝斯、鼓等。近年来,深度学习方法如卷积神经网络(CNN)和循环神经网络(RNN)已被采用,但提升效果仍然有限。本文提出一种基于频带分离旋转位置编码Transformer(称为BS-RoFormer)的新型频域方法。BS-RoFormer依赖频带分离模块将输入的复频谱图投影为子带级表征,随后通过堆叠层级式Transformer对子带内和子带间序列进行多频带掩码估计建模。为促进模型训练,我们提出使用旋转位置编码(RoPE)。在MUSDB18HQ数据集及额外500首歌曲上训练的BS-RoFormer系统,在声音混合挑战赛(SDX23)中荣获MSS赛道第一名。在MUSDB18HQ上对BS-RoFormer的轻量版本进行基准测试,我们无需额外训练数据即可实现最先进结果,平均SDR达到9.80 dB。