Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX-Turbo/.
翻译:基于灵活多模态控制信号的音频与音乐生成是一个广泛应用的课题,面临以下关键挑战:1)统一的多模态建模框架,2)大规模高质量训练数据,3)多步扩散采样带来的高昂推理成本。为此,本文提出AudioX-Turbo——一个统一且高效的任意信号转音频生成框架,能够整合文本、视频和音频信号等多种多模态条件。AudioX-Turbo采用教师-学生范式:教师模型AudioX-Base基于多模态扩散Transformer构建,配备多模态自适应融合模块以对齐多样化多模态输入,实现高保真合成;随后通过适配流匹配的分布匹配蒸馏技术将其蒸馏为少步学生模型AudioX-Turbo,并辅以基于扩散的判别器实现高质量少步生成。为支持AudioX-Turbo训练,我们构建了大规模高质量数据集IF-caps-Pro,包含约920万个样本,通过两阶段数据采集与标注流水线进行整理。我们在广泛任务上对AudioX-Turbo进行基准测试,发现该模型在文本到音频和文本到音乐生成任务中表现卓越,且仅需4个采样步,函数评估次数(NFE)比多步基线模型减少约25倍。这些结果表明,我们的方法能够在灵活的多模态控制下实现音频生成,展现出高效且强大的指令跟随能力。代码与数据集将在https://zeyuet.github.io/AudioX-Turbo/开放获取。