The emergence of novel generative modeling paradigms, particularly audio language models, has significantly advanced the field of song generation. Although state-of-the-art models are capable of synthesizing both vocals and accompaniment tracks up to several minutes long concurrently, research about partial adjustments or editing of existing songs is still underexplored, which allows for more flexible and effective production. In this paper, we present SongEditor, the first song editing paradigm that introduces the editing capabilities into language-modeling song generation approaches, facilitating both segment-wise and track-wise modifications. SongEditor offers the flexibility to adjust lyrics, vocals, and accompaniments, as well as synthesizing songs from scratch. The core components of SongEditor include a music tokenizer, an autoregressive language model, and a diffusion generator, enabling generating an entire section, masked lyrics, or even separated vocals and background music. Extensive experiments demonstrate that the proposed SongEditor achieves exceptional performance in end-to-end song editing, as evidenced by both objective and subjective metrics. Audio samples are available in \url{https://cypress-yang.github.io/SongEditor_demo/}.
翻译:新型生成建模范式,特别是音频语言模型的出现,显著推动了歌曲生成领域的发展。尽管最先进的模型能够同时合成长达数分钟的人声和伴奏音轨,但关于现有歌曲局部调整或编辑的研究仍显不足,而这将允许更灵活高效的音乐制作。本文提出SongEditor,这是首个将编辑能力引入基于语言建模的歌曲生成方法的歌曲编辑范式,支持片段级和音轨级修改。SongEditor能够灵活调整歌词、人声和伴奏,并支持从零开始合成歌曲。其核心组件包括音乐分词器、自回归语言模型和扩散生成器,可实现整段生成、掩码歌词补全乃至分离人声与背景音乐。大量实验表明,所提出的SongEditor在端到端歌曲编辑任务中取得了卓越性能,客观指标与主观评价均证实了这一点。音频样本请访问:\url{https://cypress-yang.github.io/SongEditor_demo/}。