Diffusion models have recently shown strong potential in both music generation and music source separation tasks. Although in early stages, a trend is emerging towards integrating these tasks into a single framework, as both involve generating musically aligned parts and can be seen as facets of the same generative process. In this work, we introduce a latent diffusion-based multi-track generation model capable of both source separation and multi-track music synthesis by learning the joint probability distribution of tracks sharing a musical context. Our model also enables arrangement generation by creating any subset of tracks given the others. We trained our model on the Slakh2100 dataset, compared it with an existing simultaneous generation and separation model, and observed significant improvements across objective metrics for source separation, music, and arrangement generation tasks. Sound examples are available at https://msg-ld.github.io/.
翻译:扩散模型近期在音乐生成和音乐源分离任务中均展现出强大潜力。尽管尚处早期阶段,将这两类任务整合至统一框架的趋势正逐渐显现,因为二者均涉及生成音乐对齐的声部,可视为同一生成过程的不同侧面。在本研究中,我们提出一种基于隐扩散的多轨生成模型,通过学习共享音乐语境下各音轨的联合概率分布,能够同时实现源分离与多轨音乐合成。该模型还可通过给定部分音轨生成其余任意子集的方式实现编曲生成。我们在Slakh2100数据集上训练模型,与现有同步生成与分离模型进行对比,在源分离、音乐生成及编曲生成任务的客观指标上均观察到显著提升。音频示例见 https://msg-ld.github.io/。