While most music generation models generate a mixture of stems (in mono or stereo), we propose to train a multi-stem generative model with 3 stems (bass, drums and other) that learn the musical dependencies between them. To do so, we train one specialized compression algorithm per stem to tokenize the music into parallel streams of tokens. Then, we leverage recent improvements in the task of music source separation to train a multi-stream text-to-music language model on a large dataset. Finally, thanks to a particular conditioning method, our model is able to edit bass, drums or other stems on existing or generated songs as well as doing iterative composition (e.g. generating bass on top of existing drums). This gives more flexibility in music generation algorithms and it is to the best of our knowledge the first open-source multi-stem autoregressive music generation model that can perform good quality generation and coherent source editing. Code and model weights will be released and samples are available on https://simonrouard.github.io/musicgenstem/.
翻译:尽管大多数音乐生成模型生成的是混合音轨(单声道或立体声),我们提出训练一个包含三个音轨(贝斯、鼓和其他)的多音轨生成模型,以学习它们之间的音乐依赖关系。为此,我们为每个音轨训练一个专门的压缩算法,将音乐转换为并行的符号流。随后,我们借助音乐源分离任务的最新进展,在大型数据集上训练了一个多流文本到音乐语言模型。最后,通过特定的条件调节方法,我们的模型能够对现有或已生成歌曲中的贝斯、鼓或其他音轨进行编辑,并实现迭代式创作(例如在现有鼓点基础上生成贝斯声部)。这为音乐生成算法提供了更高的灵活性,据我们所知,这是首个能够实现高质量生成与连贯音源编辑的开源多音轨自回归音乐生成模型。代码与模型权重将公开,示例音频可访问 https://simonrouard.github.io/musicgenstem/。