Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored in text-driven generation. We present Text2Score, a two-stage framework comprising a planning stage and an execution stage for generating sheet music from natural language prompts. By deriving supervision signals directly from symbolic XML data, we propose an alternative training paradigm that bypasses noisy or scarce text-music pairs. In the planning stage, an LLM orchestrator translates a natural language prompt into a structured measure-wise plan defining musical attributes such as instruments, key, time signatures, harmony, etc. This plan is then consumed by a generative model in the execution stage to produce interleaved ABC notation conditioned on the plan's structural constraints. To assess output quality, we introduce an evaluation framework covering playability, readability, instrument utilization, structural complexity, and prompt adherence, validated by expert musicians. Text2Score consistently outperforms both a pure LLM-based agentic framework and three end-to-end baselines across objective and subjective dimensions. We open-source the dataset, code, evaluation set and LLM prompts used in this work; a demo is available on our project page (https://keshavbhandari.github.io/portfolio/text2score).
翻译:开发基于文本驱动的符号音乐生成模型仍然面临挑战,主要原因在于对齐的文本-音乐数据集稀缺以及自动字幕生成管道的不可靠性。尽管多数研究聚焦于MIDI领域,但文本驱动的乐谱表示生成尚未得到充分探索。我们提出Text2Score框架,该框架包含规划阶段和执行阶段的两阶段架构,能够从自然语言提示生成乐谱。通过直接从符号化XML数据中提取监督信号,我们提出了一种替代训练范式,以规避噪声或稀缺的文本-音乐配对数据。在规划阶段,大语言模型编排器将自然语言提示转化为结构化的逐小节规划,定义乐器、调性、节拍、和声等音乐属性。该规划随后由执行阶段的生成模型处理,在规划的结构约束条件下生成交错排列的ABC记谱法。为评估输出质量,我们引入涵盖可演奏性、可读性、乐器利用率、结构复杂度及提示遵循度的评估框架,并由专业音乐家验证。实验表明,Text2Score在客观与主观维度上均显著优于纯大语言模型代理框架及三种端到端基线方法。我们开源了本研究使用的数据集、代码、评估集及大语言模型提示;演示页面见项目主页(https://keshavbhandari.github.io/portfolio/text2score)。