In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e. generating a mixture, separating the sources), we also introduce and experiment on the partial inference task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.
翻译:本研究定义了一种基于扩散的生成模型,通过学习共享上下文的源信号联合概率密度的得分,能够同时实现音乐合成与源分离。除了经典的完全推理任务(即生成混合信号、分离源信号)外,我们还引入并实验了源修补的部分推理任务,即根据部分已知源信号生成另一部分源信号(例如,生成与鼓声协调的钢琴轨道)。此外,我们针对分离任务提出了一种新型推理方法。我们使用音乐源分离标准数据集Slakh2100训练模型,展示了生成设置下的定性结果,并在分离设置中呈现了具有竞争力的定量结果。该方法首次实现单模型同时处理生成与分离任务,标志着向通用音频模型迈出的重要一步。