Music style transfer blends source structure with reference style to enable personalized music creation. However, existing zero-shot methods often struggle to capture fine-grained audio nuances, relying on coarse text descriptions or requiring expensive task-specific training. We propose Stylus, a training-free framework that repurposes pretrained image diffusion models for music style transfer in the Mel-spectrogram domain. By treating audio as structured time-frequency images, Stylus manipulates self-attention by injecting style keys and values while preserving source structural queries. To ensure high fidelity, we introduce a phase-preserving reconstruction strategy to mitigate spectrogram inversion artifacts, alongside a classifier-free-guidance-inspired control for adjustable stylization. Extensive evaluations including 2,925 human ratings demonstrate that Stylus outperforms state-of-the-art baselines, achieving 34.1% higher content preservation and 25.7% better perceptual quality. Our work validates that generic image priors can be effectively leveraged for the training-free transformation of structured Mel-spectrograms. Code and materials are available at https://github.com/Sooyyoungg/Stylus.git.
翻译:音乐风格迁移通过融合源内容结构与参考风格,实现个性化音乐创作。然而,现有零样本方法往往难以捕捉细粒度音频特征——它们或依赖粗略的文本描述,或需要昂贵的任务特定训练。我们提出Stylus,一种无需训练的框架,将预训练图像扩散模型重用于梅尔频谱图域的音乐风格迁移。通过将音频视为结构化时频图像,Stylus在保持源结构查询向量的同时,通过注入风格键值对来操控自注意力机制。为确保高保真度,我们引入相位保持重建策略以减轻频谱图反演伪影,并采用基于无分类器引导思想的控制机制实现可调节的风格化程度。包含2,925份人工评分的广泛评估表明,Stylus优于现有最先进基线方法,内容保留率提升34.1%,感知质量提升25.7%。我们的工作验证了通用图像先验可有效应用于结构化梅尔频谱图的免训练变换。代码与资源见https://github.com/Sooyyoungg/Stylus.git。