Editing signals using large pre-trained models, in a zero-shot manner, has recently seen rapid advancements in the image domain. However, this wave has yet to reach the audio domain. In this paper, we explore two zero-shot editing techniques for audio signals, which use DDPM inversion on pre-trained diffusion models. The first, adopted from the image domain, allows text-based editing. The second, is a novel approach for discovering semantically meaningful editing directions without supervision. When applied to music signals, this method exposes a range of musically interesting modifications, from controlling the participation of specific instruments to improvisations on the melody. Samples and code can be found on our examples page in https://hilamanor.github.io/AudioEditing/ .
翻译:使用大规模预训练模型以零样本方式编辑信号,近年来在图像领域取得了快速进展。然而,这一趋势尚未触及音频领域。本文探索了两种针对音频信号的零样本编辑技术,这些技术利用预训练扩散模型上的DDPM反演。第一种技术源自图像领域,可实现基于文本的编辑。第二种技术是一种新颖方法,可在无监督条件下发现语义有意义的编辑方向。当应用于音乐信号时,该方法揭示了一系列具有音乐趣味的修改,从控制特定乐器的参与度到即兴发挥旋律。示例和代码可访问我们的示例页面:https://hilamanor.github.io/AudioEditing/。