This paper addresses timbral ambiguity in instrument timbre transfer under fine-grained structural conditions. We argue this issue stems from instrument-specific expressive details in these conditions, which conflict with the target timbral properties. For example, imposing a violin's pitch-dominant vibrato contours onto a flute, which naturally exhibits loudness-dominant vibrato, impairs timbral fidelity. We propose AdaTT, a target-adaptive system that ensures high timbral fidelity across diverse timbre transfer scenarios within the ControlNet scheme. It selectively scales the frame-wise influence of pitch and loudness controls via text prompts to match the target instrument's identity. We also present a semi-automatic data construction pipeline to teach the model which expressive details to transform or preserve. Results show AdaTT achieves superior timbral fidelity and naturalness while retaining score-level content. Audio samples are available at https://dabinkim0.github.io/adatt/.
翻译:摘要:本文旨在解决细粒度结构条件下乐器音色迁移中的音色模糊性问题。我们认为,该问题源于此类条件下乐器特有的表现细节与目标音色属性之间的冲突。例如,将小提琴以音高为主的颤音轮廓施加于自然呈现响度主导颤音的长笛上,会损害音色保真度。我们提出AdaTT,一种基于ControlNet框架的目标自适应系统,可在多样化的音色迁移场景中确保高音色保真度。该系统通过文本提示选择性地缩放音高与响度控制的逐帧影响,以匹配目标乐器的身份特征。此外,我们构建了一个半自动数据生成流水线,用于训练模型区分哪些表现细节需要转换或保留。实验结果表明,AdaTT在保留乐谱级内容的同时,实现了更优的音色保真度与自然度。音频样本可访问https://dabinkim0.github.io/adatt/。