Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach, it alleviates data scarcity with orders of magnitude concept compositions by using language-free audios; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms. Together with robust contrastive language-audio pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation. Moreover, we present its controllability and generalization for X-to-Audio with "No Modality Left Behind", for the first time unlocking the ability to generate high-definition, high-fidelity audios given a user-defined modality input. Audio samples are available at https://Text-to-Audio.github.io
翻译:大规模多模态生成式建模已在文生图像和文生视频领域取得了里程碑式进展。然而,其在音频领域的应用仍滞后于前两者,主要原因有二:缺乏具备高质量文本-音频对的大规模数据集,以及长时序连续音频数据建模的复杂性。针对这些不足,本文提出基于提示增强扩散模型的Make-An-Audio技术,通过以下策略实现突破:1) 引入基于蒸馏-重编程方法的伪提示增强机制,利用无语言标注的音频数据,以数量级规模的概念组合缓解数据稀缺问题;2) 采用频谱图自编码器预测自监督音频表征以替代原始波形。结合鲁棒的对比语言-音频预训练(CLAP)表征,Make-An-Audio在客观与主观基准评估中均取得了最优结果。此外,本文首次实现了X到音频生成的跨模态可控性与泛化能力("不漏任何模态"),突破性地支持根据用户定义的模态输入生成高清高保真音频。音频样本可访问 https://Text-to-Audio.github.io 获取。