Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner.
翻译:声音内容是视频游戏、音乐、电影等多媒体作品中不可或缺的元素。近期基于扩散的高质量声音生成模型可作为创作者的有力工具。然而,尽管这些模型能生成高质量声音,其推理速度往往较慢。这一缺陷给创作者带来了负担,因为他们通常需要通过反复试错来调整声音以符合艺术意图。为解决此问题,我们提出了声音一致性轨迹模型(SoundCTM)。该模型能够灵活地在高质量一步生成与通过多步生成实现更优音质之间切换,使创作者可以先通过一步采样初步控制声音,再通过多步生成进行细化。虽然CTM本质上实现了灵活的一步与多步生成,但其优异性能高度依赖于额外的预训练特征提取器和对抗损失,这些模块训练成本高昂且在其他领域未必可用。因此,我们重构了CTM的训练框架,并利用教师网络提出了一种新的特征距离以用于蒸馏损失。此外,在蒸馏无分类器引导轨迹时,我们同时训练条件与无条件学生模型,并在推理时对这些模型进行插值。我们还提出了SoundCTM的无训练可控框架,充分利用其灵活采样能力。SoundCTM在不使用任何额外现成网络的情况下,同时实现了一步与多步实时声音生成,并展现出无需训练即可进行可控声音生成的能力。