Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.
翻译:尽管基于扩散的文本到音乐(TTM)方法已取得进展,但高效、高质量的音乐生成仍面临挑战。我们提出了Presto!,一种通过减少采样步数及每一步的计算成本来实现基于分数的扩散Transformer推理加速的方法。为减少采样步数,我们针对EDM族扩散模型开发了一种新的基于分数的分布匹配蒸馏(DMD)方法,这是首个用于TTM的基于GAN的蒸馏方法。为降低单步计算成本,我们对近期提出的层蒸馏方法进行了简洁而有效的改进,通过更好地保持隐藏状态方差来提升学习效果。最后,我们将步数蒸馏与层蒸馏方法相结合,形成双重优化策略。我们分别评估了步数蒸馏和层蒸馏方法,结果表明二者均达到了同类最佳性能。我们结合后的蒸馏方法能够生成高质量且多样性更强的输出,将基础模型的推理速度提升10-18倍(32秒单声道/立体声44.1kHz音频生成延迟为230/435毫秒,比同类SOTA方法快15倍)——据我们所知,这是目前速度最快的高质量TTM系统。音频示例可在https://presto-music.github.io/web/查看。