Singing voice separation and vocal pitch estimation are pivotal tasks in music information retrieval. Existing methods for simultaneous extraction of clean vocals and vocal pitches can be classified into two categories: pipeline methods and naive joint learning methods. However, the efficacy of these methods is limited by the following problems: On the one hand, pipeline methods train models for each task independently, resulting a mismatch between the data distributions at the training and testing time. On the other hand, naive joint learning methods simply add the losses of both tasks, possibly leading to a misalignment between the distinct objectives of each task. To solve these problems, we propose a Deep Joint Cascade Model (DJCM) for singing voice separation and vocal pitch estimation. DJCM employs a novel joint cascade model structure to concurrently train both tasks. Moreover, task-specific weights are used to align different objectives of both tasks. Experimental results show that DJCM achieves state-of-the-art performance on both tasks, with great improvements of 0.45 in terms of Signal-to-Distortion Ratio (SDR) for singing voice separation and 2.86% in terms of Overall Accuracy (OA) for vocal pitch estimation. Furthermore, extensive ablation studies validate the effectiveness of each design of our proposed model. The code of DJCM is available at https://github.com/Dream-High/DJCM .
翻译:歌声分离与歌声基频估计是音乐信息检索中的关键任务。现有的同时提取干净人声与歌声基频的方法可分为两类:流水线方法与朴素联合学习方法。然而,这些方法的有效性受限于以下问题:一方面,流水线方法独立训练各任务模型,导致训练与测试时数据分布不匹配;另一方面,朴素联合学习方法仅简单叠加两个任务的损失,可能引发各任务优化目标的不协调。为解决这些问题,我们提出一种用于歌声分离与基频估计的深度联合级联模型(DJCM)。DJCM采用新型联合级联模型结构,同步训练两项任务,并通过任务特定权重对齐不同优化目标。实验结果表明,DJCM在两项任务上均达到最优性能:歌声分离的信噪比(SDR)提升0.45,歌声基频估计的总体准确率(OA)提升2.86%。此外,广泛的消融实验验证了模型各设计的有效性。DJCM代码开源于https://github.com/Dream-High/DJCM。