Generating music with deep neural networks has been an area of active research in recent years. While the quality of generated samples has been steadily increasing, most methods are only able to exert minimal control over the generated sequence, if any. We propose the self-supervised description-to-sequence task, which allows for fine-grained controllable generation on a global level. We do so by extracting high-level features about the target sequence and learning the conditional distribution of sequences given the corresponding high-level description in a sequence-to-sequence modelling setup. We train FIGARO (FIne-grained music Generation via Attention-based, RObust control) by applying description-to-sequence modelling to symbolic music. By combining learned high level features with domain knowledge, which acts as a strong inductive bias, the model achieves state-of-the-art results in controllable symbolic music generation and generalizes well beyond the training distribution.
翻译:近年来,使用深度神经网络生成音乐已成为一个活跃的研究领域。尽管生成样本的质量稳步提升,但大多数方法对生成序列的控制能力极为有限(若有控制的话)。我们提出了自监督的描述到序列任务,该任务支持在全局层面上进行细粒度可控生成。具体而言,通过提取目标序列的高层特征,并在序列到序列建模框架中学习给定对应高层描述条件下的序列条件分布,我们实现了这一目标。我们通过对符号音乐应用描述到序列建模来训练FIGARO(基于注意力机制的细粒度鲁棒音乐生成模型)。通过将学习到的高层特征与作为强归纳偏置的领域知识相结合,该模型在可控符号音乐生成中取得了最先进的成果,并展现出远超训练分布的泛化能力。