EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

Latent diffusion models have shown promising results in text-to-audio (T2A) generation tasks, yet previous models have encountered difficulties in generation quality, computational cost, diffusion sampling, and data preparation. In this paper, we introduce EzAudio, a transformer-based T2A diffusion model, to handle these challenges. Our approach includes several key innovations: (1) We build the T2A model on the latent space of a 1D waveform Variational Autoencoder (VAE), avoiding the complexities of handling 2D spectrogram representations and using an additional neural vocoder. (2) We design an optimized diffusion transformer architecture specifically tailored for audio latent representations and diffusion modeling, which enhances convergence speed, training stability, and memory usage, making the training process easier and more efficient. (3) To tackle data scarcity, we adopt a data-efficient training strategy that leverages unlabeled data for learning acoustic dependencies, audio caption data annotated by audio-language models for text-to-audio alignment learning, and human-labeled data for fine-tuning. (4) We introduce a classifier-free guidance (CFG) rescaling method that simplifies EzAudio by achieving strong prompt alignment while preserving great audio quality when using larger CFG scores, eliminating the need to struggle with finding the optimal CFG score to balance this trade-off. EzAudio surpasses existing open-source models in both objective metrics and subjective evaluations, delivering realistic listening experiences while maintaining a streamlined model structure, low training costs, and an easy-to-follow training pipeline. Code, data, and pre-trained models are released at: https://haidog-yaqub.github.io/EzAudio-Page/.

翻译：潜在扩散模型在文本到音频生成任务中已展现出有前景的结果，但现有模型在生成质量、计算成本、扩散采样和数据准备方面仍面临挑战。本文提出EzAudio，一种基于Transformer的文本到音频扩散模型，以应对这些挑战。我们的方法包含多项关键创新：（1）我们在1D波形变分自编码器的潜在空间上构建文本到音频模型，避免了处理2D频谱图表示和使用额外神经声码器的复杂性。（2）我们设计了专门针对音频潜在表示和扩散建模优化的扩散Transformer架构，提升了收敛速度、训练稳定性和内存使用效率，使训练过程更简易高效。（3）为应对数据稀缺问题，我们采用数据高效训练策略：利用未标注数据学习声学依赖关系、通过音频-语言模型标注的音频描述数据学习文本-音频对齐，并结合人工标注数据进行微调。（4）我们提出一种无分类器引导重缩放方法，在使用较大CFG分数时既能实现强提示对齐又能保持优异音频质量，从而简化了EzAudio的使用——无需再为寻找平衡两者关系的最优CFG分数而反复调试。EzAudio在客观指标和主观评估上均超越现有开源模型，在保持精简模型结构、低训练成本和易用训练流程的同时，提供了逼真的听觉体验。代码、数据及预训练模型发布于：https://haidog-yaqub.github.io/EzAudio-Page/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日