BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features

Controllable music generation promotes the interaction between humans and composition systems by projecting the users' intent on their desired music. The challenge of introducing controllability is an increasingly important issue in the symbolic music generation field. When building controllable generative popular multi-instrument music systems, two main challenges typically present themselves, namely weak controllability and poor music quality. To address these issues, we first propose spatiotemporal features as powerful and fine-grained controls to enhance the controllability of the generative model. In addition, an efficient music representation called REMI_Track is designed to convert multitrack music into multiple parallel music sequences and shorten the sequence length of each track with Byte Pair Encoding (BPE) techniques. Subsequently, we release BandControlNet, a conditional model based on parallel Transformers, to tackle the multiple music sequences and generate high-quality music samples that are conditioned to the given spatiotemporal control features. More concretely, the two specially designed modules of BandControlNet, namely structure-enhanced self-attention (SE-SA) and Cross-Track Transformer (CTT), are utilized to strengthen the resulting musical structure and inter-track harmony modeling respectively. Experimental results tested on two popular music datasets of different lengths demonstrate that the proposed BandControlNet outperforms other conditional music generation models on most objective metrics in terms of fidelity and inference speed and shows great robustness in generating long music samples. The subjective evaluations show BandControlNet trained on short datasets can generate music with comparable quality to state-of-the-art models, while outperforming them significantly using longer datasets.

翻译：可控音乐生成通过将用户意图投射到期望的音乐上，促进了人与作曲系统之间的交互。在符号音乐生成领域，引入可控性正日益成为一个重要挑战。在构建可控的流行多乐器音乐生成系统时，通常面临两个主要挑战：弱可控性和较差的音乐质量。为解决这些问题，我们首先提出时空特征作为强大且细粒度的控制手段，以增强生成模型的可控性。此外，我们设计了一种名为REMI_Track的高效音乐表示方法，将多轨音乐转换为多个并行音乐序列，并利用字节对编码（BPE）技术缩短各轨序列长度。随后，我们发布了BandControlNet——一个基于并行Transformer的条件模型，用于处理多个音乐序列，并根据给定的时空控制特征生成高质量的音乐样本。具体而言，BandControlNet中两个专门设计的模块——结构增强自注意力（SE-SA）和跨轨Transformer（CTT）——分别用于强化生成音乐的结构建模和轨间和声建模。在两个不同长度的流行音乐数据集上的实验结果表明，所提出的BandControlNet在保真度和推理速度方面的大多数客观指标上优于其他条件音乐生成模型，并在生成长音乐样本时表现出强大的鲁棒性。主观评估表明，在短数据集上训练的BandControlNet能生成与最先进模型质量相当的音乐，而在使用更长数据集时则显著优于它们。