Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach, it alleviates data scarcity with orders of magnitude concept compositions by using language-free audios; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms. Together with robust contrastive language-audio pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation. Moreover, we present its controllability and generalization for X-to-Audio with "No Modality Left Behind", for the first time unlocking the ability to generate high-definition, high-fidelity audios given a user-defined modality input. Audio samples are available at https://Text-to-Audio.github.io

翻译：大规模多模态生成式建模已在文生图像和文生视频领域取得了里程碑式进展。然而，其在音频领域的应用仍滞后于前两者，主要原因有二：缺乏具备高质量文本-音频对的大规模数据集，以及长时序连续音频数据建模的复杂性。针对这些不足，本文提出基于提示增强扩散模型的Make-An-Audio技术，通过以下策略实现突破：1) 引入基于蒸馏-重编程方法的伪提示增强机制，利用无语言标注的音频数据，以数量级规模的概念组合缓解数据稀缺问题；2) 采用频谱图自编码器预测自监督音频表征以替代原始波形。结合鲁棒的对比语言-音频预训练（CLAP）表征，Make-An-Audio在客观与主观基准评估中均取得了最优结果。此外，本文首次实现了X到音频生成的跨模态可控性与泛化能力（"不漏任何模态"），突破性地支持根据用户定义的模态输入生成高清高保真音频。音频样本可访问 https://Text-to-Audio.github.io 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日