DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at https://ditto-music.github.io/ditto2/.

翻译：可控音乐生成方法对于以人为中心的基于AI的音乐创作至关重要，但目前受限于速度、质量和控制设计之间的权衡。特别是，扩散推理时T优化（DITTO）提供了最先进的结果，但其生成速度比实时慢10倍以上，限制了实际应用。我们提出了蒸馏扩散推理时T优化（简称DITTO-2），这是一种加速基于推理时优化的控制的新方法，可为音乐修复、外延、强度控制、旋律控制及音乐结构控制等多种应用实现快于实时的生成。我们的方法通过以下步骤实现：（1）通过高效改进的一致性蒸馏或一致性轨迹蒸馏过程，对预训练的扩散模型进行蒸馏以实现快速采样；（2）使用我们蒸馏后的模型进行推理时优化，将一步采样作为高效的代理优化任务；（3）利用估计的噪声隐变量运行最终的多步采样生成（解码），以获得高质量、快速且可控的生成。通过全面评估，我们发现我们的方法不仅将生成速度提升了10-20倍，而且同时提高了控制依从性和生成质量。此外，我们将该方法应用于一项新任务——最大化文本依从性（CLAP分数），并展示了如何将无文本输入的无条件扩散模型转换为能够实现最先进文本控制的模型。音频示例可在 https://ditto-music.github.io/ditto2/ 找到。