AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-sound generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective.

翻译：潜在扩散模型（LDMs）的最新进展已将其推至各类生成任务的前沿。然而，其迭代采样过程带来了显著的计算负担，导致生成速度缓慢，限制了其在文本到音频生成部署中的应用。在本工作中，我们提出了AudioLCM，一种新颖的基于一致性的模型，专为高效、高质量的文本到音频生成而设计。AudioLCM将一致性模型集成到生成过程中，通过将任意时间步的任意点映射至轨迹的初始点，实现了快速推理。为了克服LDMs在减少采样迭代次数时固有的收敛问题，我们提出了基于多步常微分方程（ODE）求解器的引导潜在一致性蒸馏。这一创新将时间调度从数千步缩短至数十步，同时保持了样本质量，从而实现了快速收敛和高质量生成。此外，为了优化基于Transformer的神经网络架构的性能，我们将LLaMA开创的先进技术整合到Transformer的基础框架中。该架构支持稳定高效的训练，确保了文本到音频合成中的鲁棒性能。在文本到声音生成和文本到音乐合成任务上的实验结果表明，AudioLCM仅需2次迭代即可合成高保真音频，同时其样本质量可与使用数百步的先进模型相媲美。AudioLCM在单张NVIDIA 4090Ti GPU上实现了比实时快333倍的采样速度，使得生成模型能够实际应用于文本到音频生成部署。我们广泛的初步分析表明，AudioLCM中的每一项设计均是有效的。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日