Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model achieves superior performance, especially in text-to-audio and text-to-music generation. These results demonstrate our method is capable of audio generation under multimodal control signals, showing powerful instruction-following potential. The code and datasets will be available at https://zeyuet.github.io/AudioX/.
翻译:基于灵活多模态控制信号的音频与音乐生成是一个具有广泛应用前景的研究方向,但面临以下关键挑战:1)统一的多模态建模框架,以及2)大规模、高质量的训练数据。为此,本文提出AudioX——一个面向任意模态至音频生成的统一框架,能够整合文本、视频和音频信号等多样化的多模态条件。该框架的核心设计是多模态自适应融合模块,该模块能够有效融合多种多模态输入,增强跨模态对齐能力并提升整体生成质量。为训练统一模型,我们构建了包含700万以上样本的大规模高质量数据集IF-caps,该数据集通过结构化数据标注管道进行筛选,为多模态条件音频生成提供全面监督。我们在广泛任务上将AudioX与现有最优方法进行对比,结果表明,尤其在文本到音频和文本到音乐生成任务中,模型取得了更优性能。上述成果证明该方法能够在多模态控制信号下完成音频生成,展现出强大的指令跟随潜力。代码与数据集将发布于 https://zeyuet.github.io/AudioX/。