Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured <event & order> pairs for better temporal information capture. We also introduce another structured-text encoder to aid in learning semantic alignment during the diffusion denoising process. To improve the performance of variable length generation and enhance the temporal information extraction, we design a feed-forward Transformer-based diffusion denoiser. Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data. Extensive experiments show that our method outperforms baseline models in both objective and subjective metrics, and achieves significant gains in temporal information understanding, semantic consistency, and sound quality.
翻译:大规模扩散模型在文本到音频(T2A)合成任务中取得了成功,但由于自然语言理解能力有限及数据稀缺,常存在语义错位和时间一致性差等常见问题。此外,T2A研究中广泛使用的二维空间结构在生成变长音频样本时,因未能充分优先处理时间信息而导致音频质量不理想。针对这些挑战,我们提出Make-an-Audio 2——一种基于潜在扩散的T2A方法,该方法建立在Make-an-Audio的成功之上。我们的方法包含多项技术以提升语义对齐和时间一致性:首先,使用预训练的大语言模型(LLMs)将文本解析为结构化的<事件与顺序>对,以更好地捕获时间信息;同时引入另一个结构化文本编码器,辅助扩散去噪过程中的语义对齐学习。为提升变长生成性能并增强时间信息提取,我们设计了基于前馈Transformer的扩散去噪器。最后,利用LLMs对海量音频标签数据进行增强与转换,生成音频-文本数据集,以缓解时序数据稀缺问题。大量实验表明,我们的方法在主客观指标上均优于基线模型,并在时间信息理解、语义一致性和音质方面取得了显著提升。