Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model achieves superior performance, especially in text-to-audio and text-to-music generation. These results demonstrate our method is capable of audio generation under multimodal control signals, showing powerful instruction-following potential. The code and datasets will be available at https://zeyuet.github.io/AudioX/.
翻译:基于灵活多模态控制信号的音频与音乐生成是一个具有广泛应用前景的研究方向,其面临两大关键挑战:1)统一的多模态建模框架;2)大规模高质量的训练数据。为此,本文提出AudioX——一个面向任意模态到音频生成的统一框架,能够集成多样化的多模态条件(即文本、视频及音频信号)。该框架的核心设计是一个多模态自适应融合模块,可实现不同模态输入的有效融合,增强跨模态对齐并提升整体生成质量。为训练此统一模型,我们构建了一个大规模高质量数据集IF-caps,该数据集通过结构化数据标注流程构建,包含超过700万个样本,为多模态条件音频生成提供了全面的监督信息。我们在广泛的任务范围内将AudioX与现有先进方法进行基准测试,发现我们的模型取得了优越的性能,尤其在文本到音频及文本到音乐生成任务中表现突出。这些结果表明我们的方法能够基于多模态控制信号实现音频生成,展现出强大的指令跟随潜力。代码与数据集将在https://zeyuet.github.io/AudioX/ 公开。