Diffusion-based text-to-audio (TTA) generation has made substantial progress, leveraging latent diffusion model (LDM) to produce high-quality, diverse and instruction-relevant audios. However, beyond generation, the task of audio editing remains equally important but has received comparatively little attention. Audio editing tasks face two primary challenges: executing precise edits and preserving the unedited sections. While workflows based on LDMs have effectively addressed these challenges in the field of image processing, similar approaches have been scarcely applied to audio editing. In this paper, we introduce AudioEditor, a training-free audio editing framework built on the pretrained diffusion-based TTA model. AudioEditor incorporates Null-text Inversion and EOT-suppression methods, enabling the model to preserve original audio features while executing accurate edits. Comprehensive objective and subjective experiments validate the effectiveness of AudioEditor in delivering high-quality audio edits. Code and demo can be found at https://github.com/NKU-HLT/AudioEditor.
翻译:基于扩散模型的文本到音频生成已取得显著进展,其利用潜在扩散模型生成高质量、多样化且符合指令描述的音频。然而,除了生成任务,音频编辑任务同样重要,却相对受到较少关注。音频编辑任务面临两大主要挑战:执行精确编辑与保持未编辑部分不变。尽管在图像处理领域,基于潜在扩散模型的工作流已有效应对了这些挑战,但类似方法在音频编辑中却鲜有应用。本文提出AudioEditor,一种基于预训练扩散式文本到音频模型的无训练音频编辑框架。AudioEditor融合了空文本反演与EOT抑制方法,使模型能够在执行精确编辑的同时保持原始音频特征。全面的客观与主观实验验证了AudioEditor在实现高质量音频编辑方面的有效性。代码与演示可在 https://github.com/NKU-HLT/AudioEditor 获取。