In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e., generating a mixture, separating the sources), we also introduce and experiment on the partial generation task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task based on Dirac likelihood functions. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the source separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.
翻译:本文提出一种基于扩散的生成模型,该模型通过学习共享上下文的源联合概率密度的得分,能够同时实现音乐合成与源分离。除了经典的完整推理任务(即生成混合信号、分离源信号)外,我们还引入并实验了源插补的部分生成任务,即给定其他源生成部分源信号(例如,生成与鼓点协调的钢琴音轨)。此外,我们提出一种基于狄拉克似然函数的分离任务新型推理方法。基于标准音乐源分离数据集Slakh2100进行模型训练,我们提供了生成设置下的定性结果,并在源分离设置中展示了具有竞争力的定量结果。该方法首次实现单一模型同时处理生成与分离任务,标志着向通用音频模型迈出重要一步。