Diffusion models power a vast majority of text-to-audio (TTA) generation methods. Unfortunately, these models suffer from slow inference speed due to iterative queries to the underlying denoising network, thus unsuitable for scenarios with inference time or computational constraints. This work modifies the recently proposed consistency distillation framework to train TTA models that require only a single neural network query. In addition to incorporating classifier-free guidance into the distillation process, we leverage the availability of generated audio during distillation training to fine-tune the consistency TTA model with novel loss functions in the audio space, such as the CLAP score. Our objective and subjective evaluation results on the AudioCaps dataset show that consistency models retain diffusion models' high generation quality and diversity while reducing the number of queries by a factor of 400.
翻译:扩散模型支撑了绝大多数文本到音频(TTA)生成方法。然而,由于需要迭代查询底层去噪网络,这些模型推理速度慢,因此不适用于存在推理时间或计算资源限制的场景。本研究修改了近期提出的一致性蒸馏框架,用于训练仅需单次神经网络查询的TTA模型。除了在蒸馏过程中纳入无分类器引导,我们还利用蒸馏训练期间生成的音频,在音频空间中通过新颖的损失函数(如CLAP分数)对一致性TTA模型进行微调。在AudioCaps数据集上的客观和主观评估结果表明,一致性模型在保持扩散模型高生成质量和多样性的同时,将查询次数减少了400倍。