Automatically generating symbolic music-music scores tailored to specific human needs-can be highly beneficial for musicians and enthusiasts. Recent studies have shown promising results using extensive datasets and advanced transformer architectures. However, these state-of-the-art models generally offer only basic control over aspects like tempo and style for the entire composition, lacking the ability to manage finer details, such as control at the level of individual bars. While fine-tuning a pre-trained symbolic music generation model might seem like a straightforward method for achieving this finer control, our research indicates challenges in this approach. The model often fails to respond adequately to new, fine-grained bar-level control signals. To address this, we propose two innovative solutions. First, we introduce a pre-training task designed to link control signals directly with corresponding musical tokens, which helps in achieving a more effective initialization for subsequent fine-tuning. Second, we implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts. Together, these techniques significantly enhance our ability to control music generation at the bar level, showing a 13.06\% improvement over conventional methods. Our subjective evaluations also confirm that this enhanced control does not compromise the musical quality of the original pre-trained generative model.
翻译:自动生成符合特定人类需求的符号音乐——即乐谱——对音乐家和爱好者具有重要价值。近期研究利用大规模数据集和先进的Transformer架构已展现出有前景的成果。然而,这些先进模型通常仅能对整首作品的节奏、风格等基础层面进行控制,缺乏对更精细细节(如逐小节层面控制)的调控能力。虽然对预训练的符号音乐生成模型进行微调似乎是实现这种精细控制的直接方法,但我们的研究揭示了该途径面临的挑战:模型往往难以充分响应新的细粒度小节级控制信号。为此,我们提出两项创新解决方案。首先,我们设计了一种预训练任务,将控制信号与对应音乐标记直接关联,从而为后续微调实现更有效的初始化。其次,我们引入了一种新颖的反事实损失函数,以促进生成音乐与控制提示之间更好的对齐。这些技术共同显著提升了我们在小节层面控制音乐生成的能力,相较传统方法实现了13.06%的性能提升。主观评估结果进一步证实,这种增强的控制能力并未损害原始预训练生成模型的音乐品质。