We study timbre transfer as an inference-time editing problem for music audio. Starting from a strong pre-trained latent diffusion model, we introduce a lightweight procedure that requires no additional training: (i) a dimension-wise noise injection that targets latent channels most informative of instrument identity, and (ii) an early-step clamping mechanism that re-imposes the input's melodic and rhythmic structure during reverse diffusion. The method operates directly on audio latents and is compatible with text/audio conditioning (e.g., CLAP). We discuss design choices,analyze trade-offs between timbral change and structural preservation, and show that simple inference-time controls can meaningfully steer pre-trained models for style-transfer use cases.
翻译:本研究将音色迁移视为音乐音频的推理时编辑问题。基于强大的预训练潜在扩散模型,我们提出一种无需额外训练的轻量级方法:(i) 针对最能表征乐器身份信息的潜在通道进行维度级噪声注入,(ii) 在逆向扩散过程中通过早期步长钳制机制重新施加输入音频的旋律与节奏结构。该方法直接在音频潜在空间操作,兼容文本/音频条件输入(如CLAP)。我们讨论了设计选择,分析了音色变化与结构保持之间的权衡,并证明简单的推理时控制能有效引导预训练模型实现风格迁移应用。