Recent music generation methods based on transformers have a context window of up to a minute. The music generated by these methods are largely unstructured beyond the context window. With a longer context window, learning long scale structures from musical data is a prohibitively challenging problem. This paper proposes integrating a text-to-music model with a large language model to generate music with form. We discuss our solutions to the challenges of such integration. The experimental results show that the proposed method can generate 2.5-minute-long music that is highly structured, strongly organized, and cohesive.
翻译:基于Transformer的近期音乐生成方法其上下文窗口最多可达一分钟。这些方法生成的音乐在上下文窗口之外基本缺乏结构。随着上下文窗口的延长,从音乐数据中学习长尺度结构成为一个极具挑战性的难题。本文提出将文本到音乐模型与大型语言模型相集成,以生成具有曲式结构的音乐。我们探讨了此类集成所面临挑战的解决方案。实验结果表明,所提方法能够生成长达2.5分钟、高度结构化、组织严密且连贯统一的音乐。