Electronic synthesizer sounds are controlled by parameter settings that yield complex timbral characteristics and ADSR envelopes, making synthesizer-style audio transfer particularly challenging. Recent approaches to timbre transfer often rely on spectral objectives or implicit style matching, offering limited control over envelope shaping. Moreover, public synthesizer datasets rarely provide diverse coverage of timbres and ADSR envelopes. To address these gaps, we present SynthCloner, a factorized codec model that disentangles audio into three attributes: ADSR envelope, timbre, and content. This separation enables expressive audio transfer with independent control over these attributes. Additionally, we introduce SynthCAT, a new synthesizer dataset with a task-specific rendering pipeline covering 250 timbres, 120 ADSR envelopes, and 100 MIDI sequences. Experiments show that SynthCloner outperforms baselines on both objective and subjective metrics, while enabling independent attribute control. The code, model checkpoint, and audio examples are available at https://buffett0323.github.io/synthcloner/.
翻译:电子合成器音色由参数设置控制,这些参数产生复杂的音色特征与ADSR包络,使得合成器风格音频迁移尤为困难。现有的音色迁移方法通常依赖频谱目标或隐式风格匹配,对包络形态的控制能力有限。此外,公开的合成器数据集鲜有提供音色与ADSR包络的多样化覆盖。为填补这些空白,我们提出SynthCloner——一种将音频解耦为三个属性(ADSR包络、音色与内容)的可分解编解码器模型。这种分离机制实现了对这些属性的独立控制,从而完成富有表现力的音频迁移。同时,我们构建了SynthCAT数据集,这是一个包含专用渲染流程的新型合成器数据集,涵盖250种音色、120种ADSR包络及100条MIDI序列。实验表明,SynthCloner在客观指标与主观评价上均优于基线模型,并能实现独立属性控制。代码、模型检查点及音频示例已发布于https://buffett0323.github.io/synthcloner/。