Conditional music generation offers significant advantages in terms of user convenience and control, presenting great potential in AI-generated content research. However, building conditional generative systems for multitrack popular songs presents three primary challenges: insufficient fidelity of input conditions, poor structural modeling, and inadequate inter-track harmony learning in generative models. To address these issues, we propose BandCondiNet, a conditional model based on parallel Transformers, designed to process the multiple music sequences and generate high-quality multitrack samples. Specifically, we propose multi-view features across time and instruments as high-fidelity conditions. Moreover, we propose two specialized modules for BandCondiNet: Structure Enhanced Attention (SEA) to strengthen the musical structure, and Cross-Track Transformer (CTT) to enhance inter-track harmony. We conducted both objective and subjective evaluations on two popular music datasets with different sequence lengths. Objective results on the shorter dataset show that BandCondiNet outperforms other conditional models in 9 out of 10 metrics related to fidelity and inference speed, with the exception of Chord Accuracy. On the longer dataset, BandCondiNet surpasses all conditional models across all 10 metrics. Subjective evaluations across four criteria reveal that BandCondiNet trained on the shorter dataset performs best in Richness and performs comparably to state-of-the-art models in the other three criteria, while significantly outperforming them across all criteria when trained on the longer dataset. To further expand the application scope of BandCondiNet, future work should focus on developing an advanced conditional model capable of adapting to more user-friendly input conditions and supporting flexible instrumentation.
翻译:条件音乐生成在用户便利性和可控性方面具有显著优势,在AI生成内容研究中展现出巨大潜力。然而,构建多轨流行歌曲的条件生成系统面临三个主要挑战:输入条件的保真度不足、结构建模能力弱以及生成模型中跨轨和谐性学习不充分。为解决这些问题,我们提出了BandCondiNet——一个基于并行Transformers的条件生成模型,旨在处理多音乐序列并生成高质量的多轨样本。具体而言,我们提出了跨时间和乐器维度的多视图特征作为高保真条件。此外,我们为BandCondiNet设计了两个专用模块:用于强化音乐结构的结构增强注意力(SEA),以及用于提升跨轨和谐性的跨轨Transformer(CTT)。我们在两个不同序列长度的流行音乐数据集上进行了主客观评估。在较短数据集上的客观结果表明,BandCondiNet在涉及保真度和推理速度的10项指标中有9项优于其他条件模型(和弦准确率除外)。在较长数据集上,BandCondiNet在所有10项指标上均超越所有条件模型。四项主观评估显示:在较短数据集上训练的BandCondiNet在丰富度指标上表现最佳,其余三项指标与最先进模型相当;而在较长数据集上训练时,BandCondiNet在所有指标上均显著优于现有模型。为拓展BandCondiNet的应用范围,未来工作应聚焦于开发能够适配更友好用户输入条件、支持灵活乐器配置的进阶条件生成模型。