We introduce AudEdit, an inversion-free method for text-guided editing of real audio with a pretrained rectified-flow audio generator. Text-to-audio systems such as Stable Audio 3 already expose audio-to-audio editing by noising an input recording and denoising it under a new prompt, but this inversion-style route must trade prompt adherence against preservation of rhythm, transients, timbre, and long-range musical structure. Motivated by recent inversion-free flow editing in computer vision, we develop an audio-specific direct source-to-target ordinary differential equation for one-dimensional Stable Audio 3 latents: at each flow step, we compare the target- and source-conditioned velocity fields under a shared stochastic source marginal, and update the edited latent by their difference. The resulting editor requires no training, no paired edit data, no optimization, and no access to internal attention maps. Across sound-effect and music editing sets built from FSD50K and the Song Describer Dataset, AudEdit improves CLAP text alignment and audio preservation over SDEdit, ODE inversion, and FireFlow; for example, on sound effects it raises target-text CLAP similarity from 0.42 to 0.52 over the strongest baseline while reducing FAD from 65.70 to 50.37.
翻译:我们提出AudEdit,一种利用预训练整流流音频生成器对真实音频进行文本引导编辑的无反演方法。诸如Stable Audio 3等文生音频系统已通过将输入录音加噪并在新提示下进行去噪实现了音频到音频的编辑,但这种反演式路径必须在提升提示遵循性与保留节奏、瞬态、音色及长程音乐结构之间进行权衡。受近期计算机视觉中无反演流编辑的启发,我们为Stable Audio 3的一维潜在表示开发了一种特定的音频直连源到目标常微分方程:在每个流步中,我们在共享随机源边际条件下比较目标条件与源条件速度场,并通过其差值更新已编辑的潜在表示。由此产生的编辑器无需训练、无需配对编辑数据、无需优化,且无需访问内部注意力图。在基于FSD50K和歌曲描述数据集构建的音效与音乐编辑集上,AudEdit在CLAP文本对齐与音频保真度方面均优于SDEdit、ODE反演及FireFlow;例如,在音效编辑中,其目标文本CLAP相似度从强基线的0.42提升至0.52,同时将FAD从65.70降至50.37。