In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e. generating a mixture, separating the sources), we also introduce and experiment on the partial inference task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.
翻译:在本工作中,我们定义了一种基于扩散的生成模型,该模型通过学习共享上下文的源联合概率密度的评分函数,能够同时实现音乐合成与源分离。除了经典的全推理任务(即生成混合音频、分离源)外,我们还引入并实验了源插补的部分推理任务——在该任务中,给定部分源的情况下生成其余源(例如,生成与鼓声配合良好的钢琴音轨)。此外,我们提出了一种用于分离任务的新型推理方法。我们在音乐源分离的标准数据集Slakh2100上训练模型,展示了生成设置下的定性结果,并在分离设置下展现了具有竞争力的定量结果。我们的方法是首个能够同时处理生成与分离任务的单一模型示例,这代表着向通用音频模型迈出了一步。