SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

from arxiv, Audio samples: https://koichi-saito-sony.github.io/soundctm/. Codes: https://github.com/sony/soundctm. Checkpoints: https://huggingface.co/Sony/soundctm

Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner. Our codes, pretrained models, and audio samples are available at https://github.com/sony/soundctm.

翻译：声音内容是视频游戏、音乐、电影等多媒体作品中不可或缺的要素。近期基于扩散的高质量声音生成模型为创作者提供了有价值的工具。然而，尽管能生成高质量声音，这些模型常存在推理速度慢的缺陷。这一短板使创作者在通过反复试错调整声音以契合艺术意图时备受掣肘。为解决此问题，我们提出声音一致性轨迹模型（SoundCTM）。该模型支持在高质量单步声音生成与多步生成带来的卓越音质间灵活切换，使创作者可先通过单步采样初步控制声音，再通过多步生成精调。虽然CTM本质上实现了灵活的单步与多步生成，但其优异性能高度依赖额外的预训练特征提取器与对抗损失——这些组件训练成本高昂且在其他领域未必可用。为此，我们重构CTM训练框架，利用教师网络设计蒸馏损失的新型特征距离。此外，在蒸馏无分类器引导轨迹时，我们同步训练条件与非条件学生模型，并在推理阶段对两者进行插值。我们进一步为SoundCTM提出免训练可控生成框架，充分发挥其灵活采样能力。SoundCTM无需任何额外现成网络即可实现具有竞争力的单步与多步实时声音生成。同时，我们展示了SoundCTM以免训练方式进行可控声音生成的能力。代码、预训练模型及音频样本详见https://github.com/sony/soundctm。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日