Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity.
翻译:扩散模型在文本到音频生成中发挥着关键作用。然而,由于每次生成需要对基础去噪网络进行过多查询,其推理速度缓慢。为应对这一瓶颈,我们提出了ConsistencyTTA框架,该框架仅需单次非自回归网络查询,从而将文本到音频生成速度提升数百倍。我们通过提出“CFG感知的潜在一致性模型”实现这一目标,该模型将一致性生成适配到潜在空间,并将无分类器引导机制融入模型训练。此外,与扩散模型不同,ConsistencyTTA可通过音频空间文本感知指标(如CLAP分数)进行闭环微调,以进一步提升生成质量。我们在AudioCaps数据集上的主客观评估表明,相较于基于扩散的同类模型,ConsistencyTTA在保持生成质量与多样性的同时,将推理计算量降低了400倍。