Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source-to-target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training-free and inversion-free method for audio editing. Experiments on music and event-level benchmarks across two backbones show that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.
翻译:文本引导的音频编辑旨在修改语言指定的声学内容,同时保留与编辑无关的源成分。现有免训练方法通常依赖基于反演的编辑方式。尽管无反演编辑因能降低计算开销和重构误差而具有吸引力,但该方法在音频编辑领域仍鲜有探索。其核心挑战在于通过扩散去噪动力学构建从源到目标的编辑路径。本文提出DirectAudioEdit,首次尝试开发一种免训练且无反演的音频编辑方法。在基于两种骨干网络的音乐与事件级基准测试实验中,与DDPM反演方法相比,DirectAudioEdit将宏观平均FAD和KL分别降低了15.9%和15.8%,同时实现了高达64.5%的编辑速度提升。