Latent diffusion models have shown promising results in text-to-audio (T2A) generation tasks, yet previous models have encountered difficulties in generation quality, computational cost, diffusion sampling, and data preparation. In this paper, we introduce EzAudio, a transformer-based T2A diffusion model, to handle these challenges. Our approach includes several key innovations: (1) We build the T2A model on the latent space of a 1D waveform Variational Autoencoder (VAE), avoiding the complexities of handling 2D spectrogram representations and using an additional neural vocoder. (2) We design an optimized diffusion transformer architecture specifically tailored for audio latent representations and diffusion modeling, which enhances convergence speed, training stability, and memory usage, making the training process easier and more efficient. (3) To tackle data scarcity, we adopt a data-efficient training strategy that leverages unlabeled data for learning acoustic dependencies, audio caption data annotated by audio-language models for text-to-audio alignment learning, and human-labeled data for fine-tuning. (4) We introduce a classifier-free guidance (CFG) rescaling method that simplifies EzAudio by achieving strong prompt alignment while preserving great audio quality when using larger CFG scores, eliminating the need to struggle with finding the optimal CFG score to balance this trade-off. EzAudio surpasses existing open-source models in both objective metrics and subjective evaluations, delivering realistic listening experiences while maintaining a streamlined model structure, low training costs, and an easy-to-follow training pipeline. Code, data, and pre-trained models are released at: https://haidog-yaqub.github.io/EzAudio-Page/.
翻译:潜在扩散模型在文本到音频生成任务中已展现出有前景的结果,但现有模型在生成质量、计算成本、扩散采样和数据准备方面仍面临挑战。本文提出EzAudio,一种基于Transformer的文本到音频扩散模型,以应对这些挑战。我们的方法包含多项关键创新:(1)我们在1D波形变分自编码器的潜在空间上构建文本到音频模型,避免了处理2D频谱图表示和使用额外神经声码器的复杂性。(2)我们设计了专门针对音频潜在表示和扩散建模优化的扩散Transformer架构,提升了收敛速度、训练稳定性和内存使用效率,使训练过程更简易高效。(3)为应对数据稀缺问题,我们采用数据高效训练策略:利用未标注数据学习声学依赖关系、通过音频-语言模型标注的音频描述数据学习文本-音频对齐,并结合人工标注数据进行微调。(4)我们提出一种无分类器引导重缩放方法,在使用较大CFG分数时既能实现强提示对齐又能保持优异音频质量,从而简化了EzAudio的使用——无需再为寻找平衡两者关系的最优CFG分数而反复调试。EzAudio在客观指标和主观评估上均超越现有开源模型,在保持精简模型结构、低训练成本和易用训练流程的同时,提供了逼真的听觉体验。代码、数据及预训练模型发布于:https://haidog-yaqub.github.io/EzAudio-Page/。