With the growing focus on audio in multimedia applications, numerous advanced works on audio generation have emerged. Existing studies typically treat text-to-audio (TTA) and other related audio generation tasks, such as instruction-based audio editing, as independent challenges, adopting task-specific architectures or modules. This absence of a unified modeling paradigm substantially increases the overhead and complexity of building a system for both audio generation and editing, while also leading to limited scalability. To address this issue, we introduce AudioWeave, a unified model for TTA and audio editing without additional task-specific components. Specifically, we propose a joint condition modeling approach with a factorized position embedding, enabling the diffusion transformer backbone to operate under heterogeneous inputs of TTA and audio editing. We further propose a progressive multistage training strategy to mitigate task competition and catastrophic forgetting caused by interference among multiple tasks. This in turn helps maintain the performance of each individual task and may even lead to improvements in certain aspects. Experimental results on TTA task and six audio editing tasks show that our unified model achieves competitive performance with task-specific models, laying a groundwork for further exploration of unified audio generation models.
翻译:随着多媒体应用中音频受关注度的日益提升,大量关于音频生成的先进研究成果不断涌现。现有研究通常将文本到音频生成及其他相关音频生成任务(如基于指令的音频编辑)视为独立挑战,采用任务特定的架构或模块。这种统一建模范式的缺失,不仅大幅增加了构建同时支持音频生成与编辑系统的开销与复杂性,还导致可扩展性受限。为解决这一问题,我们提出AudioWeave——一种无需额外任务特定组件的统一文本到音频生成与音频编辑模型。具体而言,我们提出了一种结合分解式位置编码的联合条件建模方法,使扩散变换器骨干网络能够在文本到音频生成与音频编辑的异构输入条件下运行。我们进一步提出渐进式多阶段训练策略,以缓解多任务干扰引发的任务竞争与灾难性遗忘问题。这有助于维持各独立任务的性能,甚至能在某些方面带来性能提升。在文本到音频生成任务及六项音频编辑任务上的实验结果表明,我们的统一模型实现了与任务特定模型相媲美的性能,为探索统一音频生成模型奠定了坚实基础。