Recent advances in music generation produce impressive samples, however, practical creation still lacks two key capabilities: composer-style structural editing and minute-scale coherence. We present MusicWeaver, a framework for generating and editing long-range music using a human-interpretable intermediate representation with guaranteed edit locality. MusicWeaver decomposes generation into two stages: it first predicts a structured plan, a multi-level song program encoding musical attributes that composers can directly edit, and then renders audio conditioned on this plan. To ensure minute-scale coherence, we introduce a Global-Local Diffusion Transformer, where a global path captures long-range musical progression via compressed representations and memory, while a local path synthesizes fine-grained acoustic detail. We further propose a Motif Memory Retrieval module that enables consistent motif recurrence with controllable variation. For editing, we propose Projected Diffusion Inpainting, an inpainting method that denoises only user-specified regions and preserves unchanged content, allowing repeated edits without drift. Finally, we introduce Structure Coherence Score and Edit Fidelity Score to evaluate long-range form and edit realization. Experiments demonstrate that MusicWeaver achieves state-of-the-art fidelity, controllability, and long-range coherence.
翻译:近期音乐生成领域取得了显著进展,但实际创作仍缺乏两项关键能力:作曲家风格的结构化编辑与分钟级的连贯性。本文提出MusicWeaver框架,通过采用具备可保证编辑局部性的人类可解释中间表示,实现长篇幅音乐的生成与编辑。该框架将生成过程分解为两个阶段:首先预测结构化规划——一种编码音乐属性的多层次歌曲程序,作曲家可直接对其进行编辑;随后基于该规划渲染音频。为确保分钟级连贯性,我们提出全局-局部扩散Transformer架构:全局路径通过压缩表示与记忆模块捕捉长程音乐演进,局部路径则合成细粒度声学细节。进一步提出动机记忆检索模块,实现具有可控变奏的连贯动机再现。针对编辑任务,我们提出投影扩散修复方法,该方法仅对用户指定区域进行去噪并保持未修改内容不变,支持重复编辑而不产生偏移。最后,引入结构连贯性分数与编辑保真度分数,用于评估长程音乐形式与编辑实现效果。实验表明,MusicWeaver在保真度、可控性与长程连贯性方面均达到当前最优水平。