Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner. Our codes, pretrained models, and audio samples are available at https://github.com/sony/soundctm.
翻译:声音内容是视频游戏、音乐、电影等多媒体作品中不可或缺的要素。近期基于扩散的高质量声音生成模型为创作者提供了有价值的工具。然而,尽管能生成高质量声音,这些模型常存在推理速度慢的缺陷。这一短板使创作者在通过反复试错调整声音以契合艺术意图时备受掣肘。为解决此问题,我们提出声音一致性轨迹模型(SoundCTM)。该模型支持在高质量单步声音生成与多步生成带来的卓越音质间灵活切换,使创作者可先通过单步采样初步控制声音,再通过多步生成精调。虽然CTM本质上实现了灵活的单步与多步生成,但其优异性能高度依赖额外的预训练特征提取器与对抗损失——这些组件训练成本高昂且在其他领域未必可用。为此,我们重构CTM训练框架,利用教师网络设计蒸馏损失的新型特征距离。此外,在蒸馏无分类器引导轨迹时,我们同步训练条件与非条件学生模型,并在推理阶段对两者进行插值。我们进一步为SoundCTM提出免训练可控生成框架,充分发挥其灵活采样能力。SoundCTM无需任何额外现成网络即可实现具有竞争力的单步与多步实时声音生成。同时,我们展示了SoundCTM以免训练方式进行可控声音生成的能力。代码、预训练模型及音频样本详见https://github.com/sony/soundctm。