At present, neural network models show powerful sequence prediction ability and are used in many automatic composition models. In comparison, the way humans compose music is very different from it. Composers usually start by creating musical motifs and then develop them into music through a series of rules. This process ensures that the music has a specific structure and changing pattern. However, it is difficult for neural network models to learn these composition rules from training data, which results in a lack of musicality and diversity in the generated music. This paper posits that integrating the learning capabilities of neural networks with human-derived knowledge may lead to better results. To archive this, we develop the POP909$\_$M dataset, the first to include labels for musical motifs and their variants, providing a basis for mimicking human compositional habits. Building on this, we propose MeloTrans, a text-to-music composition model that employs principles of motif development rules. Our experiments demonstrate that MeloTrans excels beyond existing music generation models and even surpasses Large Language Models (LLMs) like ChatGPT-4. This highlights the importance of merging human insights with neural network capabilities to achieve superior symbolic music generation.
翻译:目前,神经网络模型展现出强大的序列预测能力,并被应用于众多自动作曲模型。相比之下,人类的作曲方式与之迥异。作曲家通常从创作音乐动机开始,随后通过一系列规则将其发展成完整的乐曲。这一过程确保了音乐具有特定的结构与变化模式。然而,神经网络模型难以从训练数据中学习这些作曲规则,导致生成的音乐缺乏音乐性与多样性。本文认为,将神经网络的学习能力与人类知识相结合可能产生更优的结果。为此,我们构建了POP909_M数据集,这是首个包含音乐动机及其变体标注的数据集,为模拟人类作曲习惯提供了基础。在此基础上,我们提出了MeloTrans,一个采用动机发展规则原理的文本到音乐作曲模型。实验表明,MeloTrans的表现优于现有音乐生成模型,甚至超越了如ChatGPT-4等大型语言模型。这凸显了将人类洞见与神经网络能力相融合对于实现卓越的符号音乐生成的重要性。