With the development of diffusion models, text-guided image style transfer has demonstrated high-quality controllable synthesis results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we propose a bias-reduced stylization technique to obtain stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and source code are available at https://lsfhuihuiff.github.io/MusicTI/.
翻译:随着扩散模型的发展,文本引导的图像风格迁移已展现出高质量的可控合成效果。然而,利用文本实现多样化的音乐风格迁移面临重大挑战,主要由于匹配的音频-文本数据集匮乏。音乐作为一种抽象而复杂的艺术形式,即使在同一体裁中也存在变化与复杂性,这使得精确的文本描述变得困难。本文提出一种能够利用少量数据有效捕捉音乐属性的风格迁移方法。我们引入新颖的时变文本反转模块,在不同层级精确捕捉梅尔频谱特征。在推理阶段,我们提出一种降偏差风格化技术以获得稳定结果。实验表明,本方法可迁移特定乐器的风格,并融入自然声音谱写旋律。示例及源代码见 https://lsfhuihuiff.github.io/MusicTI/。