Separation of multiple singing voices into each voice is a rarely studied area in music source separation research. The absence of a benchmark dataset has hindered its progress. In this paper, we present an evaluation dataset and provide baseline studies for multiple singing voices separation. First, we introduce MedleyVox, an evaluation dataset for multiple singing voices separation. We specify the problem definition in this dataset by categorizing it into i) unison, ii) duet, iii) main vs. rest, and iv) N-singing separation. Second, to overcome the absence of existing multi-singing datasets for a training purpose, we present a strategy for construction of multiple singing mixtures using various single-singing datasets. Third, we propose the improved super-resolution network (iSRNet), which greatly enhances initial estimates of separation networks. Jointly trained with the Conv-TasNet and the multi-singing mixture construction strategy, the proposed iSRNet achieved comparable performance to ideal time-frequency masks on duet and unison subsets of MedleyVox. Audio samples, the dataset, and codes are available on our website (https://github.com/jeonchangbin49/MedleyVox).
翻译:多声部歌声分离是音乐源分离研究中较少涉及的领域,基准数据集的缺失阻碍了其发展。本文提出一个评估数据集并提供多声部歌声分离的基线研究。首先,我们介绍MedleyVox——多声部歌声分离评估数据集,通过将其划分为i)齐唱、ii)二重唱、iii)主声部与其余声部、iv)N声部分离四类,明确该数据集中的问题定义。其次,针对现有训练用多声部歌声数据集的缺失问题,我们提出一种利用多个单声部歌声数据集构建多声部混合信号的策略。最后,我们提出改进超分辨率网络(iSRNet),该网络能显著增强分离网络的初始估计。通过联合训练Conv-TasNet与多声部混合信号构建策略,所提iSRNet在MedleyVox的二重唱与齐唱子集上取得了与理想时频掩码相当的性能。音频样本、数据集及代码均可在我们的网站(https://github.com/jeonchangbin49/MedleyVox)获取。