Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free-sonic.github.io/
翻译:文本到音频(TTA)生成已取得显著进展,但实现精确且一致的音频编辑仍是重大挑战。现有方法在时序一致性与背景保留之间难以平衡。本文提出FreeSonic这一免训练框架,基于当前最先进的Rectified Flow型TangoFlux模型。FreeSonic通过优化反演-反转过程与联合文本-音频注意力图,实现精确的目标片段提取。在内容编辑中,一种新颖的调度式注意力解耦机制将修改限制在目标区域,同时保留原始声学上下文。此外,面向任务的噪声注入增强了音频移除和非刚性替换等任务的通用性。大量实验结果表明,FreeSonic通过提供高保真且高效的精确一致音频编辑方案,实现了卓越的平衡。项目与演示:https://free-sonic.github.io/