Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses large language models to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which allow the model to process instruction texts and audio inputs concurrently and yield the desired edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps, yet it achieves superior performance across all tasks compared to existing baselines, and demonstrates performance comparable to the models trained for specific tasks. This advancement not only enhances the efficiency of text-to-music editing but also broadens the applicability of music language models in dynamic music production environments.
翻译:文本到音乐编辑的最新进展——即采用文本查询来修改音乐(例如,通过改变其风格或调整乐器组成部分)——为AI辅助音乐创作带来了独特的挑战与机遇。该领域先前的方法受限于必须从头开始训练特定的编辑模型,这既耗费资源又效率低下;其他研究则使用大型语言模型来预测编辑后的音乐,导致音频重建不精确。为了结合优势并解决这些局限性,我们引入了Instruct-MusicGen,这是一种新颖的方法,通过对预训练的MusicGen模型进行微调,使其能够高效地遵循添加、移除或分离音轨等编辑指令。我们的方法包括对原始MusicGen架构的修改,通过引入一个文本融合模块和一个音频融合模块,使模型能够同时处理指令文本和音频输入,并生成所需的编辑后音乐。值得注意的是,Instruct-MusicGen仅向原始MusicGen模型引入了8%的新参数,并且仅训练了5千步,然而它在所有任务上都实现了优于现有基线的性能,并展现出与针对特定任务训练的模型相当的性能。这一进展不仅提升了文本到音乐编辑的效率,还拓宽了音乐语言模型在动态音乐制作环境中的适用性。