We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound-shape). Sketch2Sound can be implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only 40k steps of fine-tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal imitation while retaining the adherence to an input text prompt and audio quality compared to a text-only baseline. Sketch2Sound allows sound artists to create sounds with the semantic flexibility of text prompts and the expressivity and precision of a sonic gesture or vocal imitation. Sound examples are available at https://hugofloresgarcia.art/sketch2sound/.
翻译:本文提出Sketch2Sound,一种能够从一组可解释的时变控制信号(响度、明亮度、音高)以及文本提示生成高质量声音的音频生成模型。Sketch2Sound能够根据声音模仿(即人声模仿或参考声音形状)合成任意声音。该模型可在任何基于文本到音频的潜在扩散Transformer(DiT)架构上实现,每个控制信号仅需4万步微调和一个线性层,因此比ControlNet等现有方法更为轻量。为了从类似草图的声音模仿中合成音频,我们提出在训练过程中对控制信号应用随机中值滤波,使得Sketch2Sound能够接受具有不同时间粒度灵活度的控制信号作为输入。实验表明,相较于纯文本基线模型,Sketch2Sound能够根据人声模仿合成符合输入控制信号核心特征的声音,同时保持对输入文本提示的忠实度与音频质量。Sketch2Sound使声音艺术家能够结合文本提示的语义灵活性与声音手势或人声模仿的表现力及精确度来创作声音。音频示例可在 https://hugofloresgarcia.art/sketch2sound/ 获取。