Recently, multi-band spectrogram-based approaches such as Band-Split RNN (BSRNN) have demonstrated promising results for music source separation. In our recent work, we introduce the BS-RoFormer model which inherits the idea of band-split scheme in BSRNN at the front-end, and then uses the hierarchical Transformer with Rotary Position Embedding (RoPE) to model the inner-band and inter-band sequences for multi-band mask estimation. This model has achieved state-of-the-art performance, but the band-split scheme is defined empirically, without analytic supports from the literature. In this paper, we propose Mel-RoFormer, which adopts the Mel-band scheme that maps the frequency bins into overlapped subbands according to the mel scale. In contract, the band-split mapping in BSRNN and BS-RoFormer is non-overlapping and designed based on heuristics. Using the MUSDB18HQ dataset for experiments, we demonstrate that Mel-RoFormer outperforms BS-RoFormer in the separation tasks of vocals, drums, and other stems.
翻译:近期,基于多频带频谱图的方法(如Band-Split RNN, BSRNN)在音乐源分离任务中取得了显著成效。我们前期提出的BS-RoFormer模型继承了BSRNN中频带分割的前端思想,并采用基于旋转位置编码(RoPE)的分层Transformer对带内和带间序列进行建模,以实现多频带掩码估计。该模型虽已达到最优性能,但其频带分割方案仅基于经验定义,缺乏文献中的理论支撑。本文提出Mel-RoFormer模型,采用梅尔频带方案,通过梅尔尺度将频率映射为重叠子带。作为对比,BSRNN和BS-RoFormer中的频带分割映射方案是非重叠且基于启发式设计的。基于MUSDB18HQ数据集的实验表明,Mel-RoFormer在人声、鼓声及其他音轨分离任务中均优于BS-RoFormer。