Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervised model-based deep learning approach to musical source separation. Each source is modelled with a differentiable parametric source-filter model. A neural network is trained to reconstruct the observed mixture as a sum of the sources by estimating the source models' parameters given their fundamental frequencies. At test time, soft masks are obtained from the synthesized source signals. The experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms learning-free methods based on nonnegative matrix factorization and a supervised deep learning baseline. Integrating domain knowledge in the form of source models into a data-driven method leads to high data efficiency: the proposed approach achieves good separation quality even when trained on less than three minutes of audio. This work makes powerful deep learning based separation usable in scenarios where training data with ground truth is expensive or nonexistent.
翻译:有监督深度学习方法在欠定音频源分离任务中取得了最先进的性能,但需要混合信号及其对应孤立源信号的数据集。对于音乐混合信号而言,获取此类数据集的成本极高,因此迫切需要无监督方法。本文提出了一种新颖的无监督模型驱动深度学习方法用于音乐源分离。每个声源由可微参数源-滤波器模型建模,通过训练神经网络,根据各声源基频估计其模型参数,从而将观测到的混合信号重构为各声源之和。在测试阶段,通过合成源信号获得软掩膜。在声乐合奏分离任务上的实验结果表明,所提方法优于基于非负矩阵分解的无学习方法和有监督深度学习基线。将源模型形式的领域知识融入数据驱动方法可显著提升数据效率:即使在少于三分钟音频的训练数据上,所提方法也能实现良好的分离质量。这项工作使得强大的深度学习分离方法能够在标注训练数据昂贵或缺失的场景中落地应用。