AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX-Turbo/.

翻译：基于灵活多模态控制信号的音频与音乐生成是一个广泛应用的课题，面临以下关键挑战：1）统一的多模态建模框架，2）大规模高质量训练数据，3）多步扩散采样带来的高昂推理成本。为此，本文提出AudioX-Turbo——一个统一且高效的任意信号转音频生成框架，能够整合文本、视频和音频信号等多种多模态条件。AudioX-Turbo采用教师-学生范式：教师模型AudioX-Base基于多模态扩散Transformer构建，配备多模态自适应融合模块以对齐多样化多模态输入，实现高保真合成；随后通过适配流匹配的分布匹配蒸馏技术将其蒸馏为少步学生模型AudioX-Turbo，并辅以基于扩散的判别器实现高质量少步生成。为支持AudioX-Turbo训练，我们构建了大规模高质量数据集IF-caps-Pro，包含约920万个样本，通过两阶段数据采集与标注流水线进行整理。我们在广泛任务上对AudioX-Turbo进行基准测试，发现该模型在文本到音频和文本到音乐生成任务中表现卓越，且仅需4个采样步，函数评估次数（NFE）比多步基线模型减少约25倍。这些结果表明，我们的方法能够在灵活的多模态控制下实现音频生成，展现出高效且强大的指令跟随能力。代码与数据集将在https://zeyuet.github.io/AudioX-Turbo/开放获取。