Symbolic music generation aims to create musical notes, which can help users compose music, such as generating target instrumental tracks from scratch, or based on user-provided source tracks. Considering the diverse and flexible combination between source and target tracks, a unified model capable of generating any arbitrary tracks is of crucial necessity. Previous works fail to address this need due to inherent constraints in music representations and model architectures. To address this need, we propose a unified representation and diffusion framework named GETMusic (`GET' stands for GEnerate music Tracks), which includes a novel music representation named GETScore, and a diffusion model named GETDiff. GETScore represents notes as tokens and organizes them in a 2D structure, with tracks stacked vertically and progressing horizontally over time. During training, tracks are randomly selected as either the target or source. In the forward process, target tracks are corrupted by masking their tokens, while source tracks remain as ground truth. In the denoising process, GETDiff learns to predict the masked target tokens, conditioning on the source tracks. With separate tracks in GETScore and the non-autoregressive behavior of the model, GETMusic can explicitly control the generation of any target tracks from scratch or conditioning on source tracks. We conduct experiments on music generation involving six instrumental tracks, resulting in a total of 665 combinations. GETMusic provides high-quality results across diverse combinations and surpasses prior works proposed for some specific combinations.
翻译:符号音乐生成旨在创作音符,可辅助用户进行音乐创作,例如从头生成目标乐器轨道,或基于用户提供的源轨道生成音乐。考虑到源轨道与目标轨道之间多样且灵活的组合方式,能够生成任意轨道的统一模型具有至关重要的必要性。先前的工作因音乐表示和模型架构的内在限制而未能满足这一需求。为此,我们提出了一种名为GETMusic('GET'代表生成音乐轨道)的统一表示与扩散框架,该框架包含一种新型音乐表示GETScore和一种扩散模型GETDiff。GETScore将音符表示为令牌,并以二维结构组织这些令牌,其中轨道沿垂直方向堆叠,并随时间水平推进。在训练过程中,轨道随机被选为目标轨道或源轨道。在前向过程中,目标轨道通过掩蔽其令牌而被破坏,而源轨道则保持为真实值。在去噪过程中,GETDiff以源轨道为条件,学习预测被掩蔽的目标令牌。借助GETScore中独立的轨道以及模型的非自回归特性,GETMusic能够明确控制从头生成任意目标轨道,或以源轨道为条件进行生成。我们针对包含六个乐器轨道的音乐生成进行了实验,总共涉及665种组合。GETMusic在多种组合下均能提供高质量结果,并超越了先前针对某些特定组合提出的方法。