Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-sound generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective.
翻译:潜在扩散模型(LDMs)的最新进展已将其推至各类生成任务的前沿。然而,其迭代采样过程带来了显著的计算负担,导致生成速度缓慢,限制了其在文本到音频生成部署中的应用。在本工作中,我们提出了AudioLCM,一种新颖的基于一致性的模型,专为高效、高质量的文本到音频生成而设计。AudioLCM将一致性模型集成到生成过程中,通过将任意时间步的任意点映射至轨迹的初始点,实现了快速推理。为了克服LDMs在减少采样迭代次数时固有的收敛问题,我们提出了基于多步常微分方程(ODE)求解器的引导潜在一致性蒸馏。这一创新将时间调度从数千步缩短至数十步,同时保持了样本质量,从而实现了快速收敛和高质量生成。此外,为了优化基于Transformer的神经网络架构的性能,我们将LLaMA开创的先进技术整合到Transformer的基础框架中。该架构支持稳定高效的训练,确保了文本到音频合成中的鲁棒性能。在文本到声音生成和文本到音乐合成任务上的实验结果表明,AudioLCM仅需2次迭代即可合成高保真音频,同时其样本质量可与使用数百步的先进模型相媲美。AudioLCM在单张NVIDIA 4090Ti GPU上实现了比实时快333倍的采样速度,使得生成模型能够实际应用于文本到音频生成部署。我们广泛的初步分析表明,AudioLCM中的每一项设计均是有效的。