Recently, multi-instrument music generation has become a hot topic. Different from single-instrument generation, multi-instrument generation needs to consider inter-track harmony besides intra-track coherence. This is usually achieved by composing note segments from different instruments into a signal sequence. This composition could be on different scales, such as note, bar, or track. Most existing work focuses on a particular scale, leading to a shortage in modeling music with diverse temporal and track dependencies. This paper proposes a multi-scale attentive Transformer model to improve the quality of multi-instrument generation. We first employ multiple Transformer decoders to learn multi-instrument representations of different scales and then design an attentive mechanism to fuse the multi-scale information. Experiments conducted on SOD and LMD datasets show that our model improves both quantitative and qualitative performance compared to models based on single-scale information. The source code and some generated samples can be found at https://github.com/HaRry-qaq/MSAT.
翻译:近年来,多乐器音乐生成已成为研究热点。与单乐器生成不同,多乐器生成需在保证单轨连贯性的同时兼顾跨轨和谐性。这通常通过将不同乐器的音符片段组合成信号序列来实现。此类组合可基于多种尺度,如音符、小节或音轨。现有研究多聚焦单一尺度,导致对具有复杂时间与轨道依赖性的音乐建模能力不足。本文提出一种多尺度注意力Transformer模型以提升多乐器生成质量。我们首先采用多个Transformer解码器学习不同尺度的多乐器表示,进而设计注意力机制融合多尺度信息。在SOD和LMD数据集上的实验表明,与基于单尺度信息的模型相比,我们的模型在定量和定性性能上均有提升。源代码及部分生成样本可在https://github.com/HaRry-qaq/MSAT获取。