Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipelines that propagate source separation artifacts and produce incoherent synthesized timbres across stems. This paper proposes MixtureTT, to the best of our knowledge the first system for flexible per-stem timbre transfer directly from a polyphonic mixture. Given a mixture and a separate timbre reference for each target voice, MixtureTT jointly transfers all stems to the specified instruments through a shared diffusion process. Modeling the dependencies across the per-stem content and cross-stem harmonic, the proposed joint stem diffusion transformer eliminates cascaded separation error, reduces inference cost by a factor equal to the number of stems, and yields more coherent multi-stem outputs. Despite operating under a strictly harder input condition, evaluations on the SATB choral dataset show that MixtureTT outperforms single-instrument baselines on both objective and subjective metrics demonstrating the necessity of dedicated multi-instrument timbre transfer over the naive separate-then-transfer pipelines. As a result, this work confirms that the cross-stem modeling is essential for mixture-level timbre transfer as the proposed joint setting consistently exceeds an equivalent single-stem ablation.
翻译:音色迁移旨在改变音乐录音的音色特征,同时保留原始旋律与节奏。尽管单乐器音色迁移已取得显著进展,现有面向多乐器场景的方法仍依赖"分离-迁移"流水线,这种方案会传播源分离伪影并导致不同声部合成音色不连贯。本文提出MixtureTT——据我们所知,这是首个可直接对多声部混合音频进行灵活分轨音色迁移的系统。给定混合信号及各目标声部的独立音色参考,MixtureTT通过共享扩散过程将所有声部同步迁移至指定乐器。所提出的联合声部分扩散变压器通过对各声部内容与跨声部和声的依赖关系建模,消除了级联分离误差,将推理成本降低至声部数量的倒数,并生成更连贯的多声部输出。尽管在严格更困难的输入条件下运行,SATB合唱数据集上的评估表明,MixtureTT在客观与主观指标上均优于单乐器基线,证明了专用多乐器音色迁移相较于简单"分离-迁移"流水线的必要性。本工作证实,由于所提出的联合设置始终优于等效单声部消融模型,跨声部建模对混合级音色迁移至关重要。