Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.

翻译：多模态模型的最新进展推动了音频理解、生成与编辑技术的快速进步。然而，这些能力通常由专用模型分别处理，真正能够无缝整合这三种任务的统一框架仍开发不足。尽管部分先驱工作已探索了音频理解与生成的统一，但它们往往局限于特定领域。为解决这一问题，我们提出Audio-Omni，这是首个端到端框架，将生成与编辑统一至通用声音、音乐和语音域，并集成多模态理解能力。我们的架构将冻结的多模态大语言模型用于高层推理，与可训练的扩散变换器用于高保真合成形成协同效应。为克服音频编辑领域关键的数据稀缺问题，我们构建了AudioEdit，一个包含超过百万条精心策划编辑对的新型大规模数据集。广泛实验表明，Audio-Omni在一系列基准测试中达到最先进性能，优于先前的统一方法，同时其表现与专用专家模型相当甚至更优。除核心能力外，Audio-Omni展现出显著的继承能力，包括知识增强推理生成、上下文内生成以及音频生成的零样本跨语言控制，为迈向通用生成式音频智能指明有前景的方向。代码、模型和数据集将公开发布于https://zeyuet.github.io/Audio-Omni。