In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e., generating a mixture, separating the sources), we also introduce and experiment on the partial generation task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task based on Dirac likelihood functions. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the source separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.
翻译:在本文中,我们定义了一种基于扩散的生成模型,该模型通过学习共享上下文的源联合概率密度的得分,能够同时完成音乐合成和源分离。除了经典的完整推理任务(即生成混合音、分离音源),我们还引入并实验了源插补的部分生成任务——即在给定部分音源的条件下生成其余音源(例如,生成与鼓点完美配合的钢琴音轨)。此外,针对分离任务,我们提出了一种基于狄拉克似然函数的新型推理方法。我们在音乐源分离标准数据集Slakh2100上训练模型,提供了生成任务下的定性结果,并在源分离任务中展示了具有竞争力的定量结果。本方法是首个能够同时处理生成与分离任务的单一模型,标志着向通用音频模型迈出了重要一步。