Music editing primarily entails the modification of instrument tracks or remixing in the whole, which offers a novel reinterpretation of the original piece through a series of operations. These music processing methods hold immense potential across various applications but demand substantial expertise. Prior methodologies, although effective for image and audio modifications, falter when directly applied to music. This is attributed to music's distinctive data nature, where such methods can inadvertently compromise the intrinsic harmony and coherence of music. In this paper, we develop InstructME, an Instruction guided Music Editing and remixing framework based on latent diffusion models. Our framework fortifies the U-Net with multi-scale aggregation in order to maintain consistency before and after editing. In addition, we introduce chord progression matrix as condition information and incorporate it in the semantic space to improve melodic harmony while editing. For accommodating extended musical pieces, InstructME employs a chunk transformer, enabling it to discern long-term temporal dependencies within music sequences. We tested InstructME in instrument-editing, remixing, and multi-round editing. Both subjective and objective evaluations indicate that our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony. Demo samples are available at https://musicedit.github.io/
翻译:音乐编辑主要涉及对乐器音轨的修改或整体混音,通过一系列操作为原曲提供全新诠释。这些音乐处理方法在各类应用中潜力巨大,但需要深厚的专业知识。先前的方法虽在图像和音频修改方面效果显著,但直接应用于音乐时则表现不佳。这是由于音乐独特的数据特性——此类方法可能无意中破坏音乐固有的和谐性与连贯性。本文中,我们提出了InstructME,一种基于潜在扩散模型的指令引导音乐编辑与混音框架。该框架通过多尺度聚合增强U-Net,以维持编辑前后的连贯性。此外,我们引入和弦进行矩阵作为条件信息,并将其融入语义空间,以在编辑过程中提升旋律和谐性。为处理长时段音乐片段,InstructME采用分块变换器,使其能够捕捉音乐序列中的长期时间依赖关系。我们在乐器编辑、混音及多轮编辑任务上对InstructME进行了测试。主观与客观评估均表明,本方法在音乐质量、文本相关性及和谐性上显著优于现有系统。演示样本请访问:https://musicedit.github.io/