Music editing primarily entails the modification of instrument tracks or remixing in the whole, which offers a novel reinterpretation of the original piece through a series of operations. These music processing methods hold immense potential across various applications but demand substantial expertise. Prior methodologies, although effective for image and audio modifications, falter when directly applied to music. This is attributed to music's distinctive data nature, where such methods can inadvertently compromise the intrinsic harmony and coherence of music. In this paper, we develop InstructME, an Instruction guided Music Editing and remixing framework based on latent diffusion models. Our framework fortifies the U-Net with multi-scale aggregation in order to maintain consistency before and after editing. In addition, we introduce chord progression matrix as condition information and incorporate it in the semantic space to improve melodic harmony while editing. For accommodating extended musical pieces, InstructME employs a chunk transformer, enabling it to discern long-term temporal dependencies within music sequences. We tested InstructME in instrument-editing, remixing, and multi-round editing. Both subjective and objective evaluations indicate that our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony. Demo samples are available at https://musicedit.github.io/
翻译:音乐编辑主要涉及对乐器音轨的修改或整体混音,通过一系列操作对原始曲目进行全新诠释。这些音乐处理方法在各类应用中潜力巨大,但需要深厚的专业知识。以往的方法虽能有效处理图像与音频的修改,但直接应用于音乐时却表现不佳,这是由于音乐独特的数据特性——此类方法可能无意间破坏音乐内在的和谐性与连贯性。本文提出InstructME,一种基于潜在扩散模型的指令引导音乐编辑与混音框架。该框架通过多尺度聚合增强U-Net,以维持编辑前后的一致性。此外,我们引入和弦进行矩阵作为条件信息,并将其融入语义空间,以在编辑过程中提升旋律和声性。为处理长篇幅音乐片段,InstructME采用分块变换器,以捕捉音乐序列中的长期时间依赖性。我们在乐器编辑、混音及多轮编辑任务中测试了InstructME。主观与客观评估均表明,所提方法在音乐质量、文本相关性与和声性上显著优于现有系统。示例音频可于 https://musicedit.github.io/ 获取。