The advancement of diffusion-based text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve stem-specific timbre transfer, which requires altering specific stems while strictly preserving the background accompaniment. This limitation severely hinders practical application, since real-world production necessitates precise manipulation of components within dense mixtures. Our key finding is that, while vanilla cross-attention captures semantic features of stems, it lacks the spectral resolution to strictly localize targets in dense mixtures, leading to boundary leakage. To resolve this dilemma, we propose Polyphonia, a zero-shot editing framework with Acoustic-Informed Attention Calibration. Rather than relying solely on diffuse semantic attention, Polyphonia leverages a probabilistic acoustic prior to establish coarse boundaries, enabling non-target stems preserved precise semantic synthesis. For evaluation, we propose PolyEvalPrompts, a standardized prompt set with 1,170 timbre transfer tasks in polyphonic music. Specifically, Polyphonia achieves an increase of 15.5% in target alignment compared to baselines, while maintaining competitive music fidelity and non-target integrity.
翻译:基于扩散模型的文本生成音乐技术为零样本音乐编辑开辟了新途径。然而,现有方法无法实现特定声部的音色迁移——即要求在不改变背景伴奏的前提下精准修改目标声部。这一缺陷严重制约实际应用,因为真实音乐制作需要在密集混合音频中精确操控各组件。我们的关键发现是:传统交叉注意力机制虽能捕获声部的语义特征,却缺乏在密集混合中准确定位目标的频谱分辨率,导致边界泄漏。为解决该问题,我们提出Polyphonia——一种基于声学感知注意力校准的零样本编辑框架。不同于单纯依赖弥散语义注意力,Polyphonia利用概率声学先验建立粗略边界,使非目标声部在精确语义合成过程中得以保留。为评估性能,我们提出标准化提示集PolyEvalPrompts,包含1,170项多声道音乐音色迁移任务。实验表明,Polyphonia在目标对齐度上较基线方法提升15.5%,同时保持具有竞争力的音乐保真度与非目标完整性。