Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade-off between over-regularized (poor output quality) and under-regularized (difficult to predict) latent representations. We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization. This allows direct comparison to well-studied discrete neural audio codec models, and the construction of rate-distortion curves for audio VAEs. We evaluate the impact of target-KL regularization on text-to-sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.
翻译:潜在扩散模型已成为许多生成任务的主导范式,包括音频生成(如文本到音频、文本到音乐和文本到语音)。潜在扩散的关键组件是自编码器(VAE),它将高维信号压缩为低帧率的连续表示,有利于下游预测。正则化这些VAE具有挑战性,因为过度正则化(输出质量差)与欠正则化(难以预测)的潜在表示之间存在权衡。我们提出了一个框架,通过压缩研究这种权衡,并利用目标KL正则化在特定比特率下训练音频VAE。这使得能够与经过充分研究的离散神经音频编解码模型进行直接比较,并构建音频VAE的率失真曲线。我们评估了目标KL正则化对文本到声音生成的影响,发现扫描压缩率有助于确定最优生成设置。