SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner.

翻译：声音内容是视频游戏、音乐、电影等多媒体作品中不可或缺的元素。近期基于扩散的高质量声音生成模型可作为创作者的有力工具。然而，尽管这些模型能生成高质量声音，其推理速度往往较慢。这一缺陷给创作者带来了负担，因为他们通常需要通过反复试错来调整声音以符合艺术意图。为解决此问题，我们提出了声音一致性轨迹模型（SoundCTM）。该模型能够灵活地在高质量一步生成与通过多步生成实现更优音质之间切换，使创作者可以先通过一步采样初步控制声音，再通过多步生成进行细化。虽然CTM本质上实现了灵活的一步与多步生成，但其优异性能高度依赖于额外的预训练特征提取器和对抗损失，这些模块训练成本高昂且在其他领域未必可用。因此，我们重构了CTM的训练框架，并利用教师网络提出了一种新的特征距离以用于蒸馏损失。此外，在蒸馏无分类器引导轨迹时，我们同时训练条件与无条件学生模型，并在推理时对这些模型进行插值。我们还提出了SoundCTM的无训练可控框架，充分利用其灵活采样能力。SoundCTM在不使用任何额外现成网络的情况下，同时实现了一步与多步实时声音生成，并展现出无需训练即可进行可控声音生成的能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日