Editing signals using large pre-trained models, in a zero-shot manner, has recently seen rapid advancements in the image domain. However, this wave has yet to reach the audio domain. In this paper, we explore two zero-shot editing techniques for audio signals, which use DDPM inversion with pre-trained diffusion models. The first, which we coin ZEro-shot Text-based Audio (ZETA) editing, is adopted from the image domain. The second, named ZEro-shot UnSupervized (ZEUS) editing, is a novel approach for discovering semantically meaningful editing directions without supervision. When applied to music signals, this method exposes a range of musically interesting modifications, from controlling the participation of specific instruments to improvisations on the melody. Samples and code can be found in https://hilamanor.github.io/AudioEditing/ .
翻译:利用大规模预训练模型进行零样本信号编辑,近期在图像领域取得了快速进展。然而,这一趋势尚未延伸至音频领域。本文探索了两种针对音频信号的零样本编辑技术,均采用基于预训练扩散模型的DDPM反演方法。第一种技术借鉴自图像领域,我们称之为基于文本的零样本音频编辑。第二种技术名为零样本无监督编辑,是一种无需监督即可发现语义上有意义的编辑方向的新方法。当应用于音乐信号时,该方法揭示了一系列具有音乐趣味性的修改方式,从控制特定乐器的参与到旋律的即兴改编。相关音频样本与代码可在 https://hilamanor.github.io/AudioEditing/ 获取。