We introduce Mix2Morph, a text-to-audio diffusion model fine-tuned to perform sound morphing without a dedicated dataset of morphs. By finetuning on noisy surrogate mixes at higher diffusion timesteps, Mix2Morph yields stable, perceptually coherent morphs that convincingly integrate qualities of both sources. We specifically target sound infusions, a practically and perceptually motivated subclass of morphing in which one sound acts as the dominant primary source, providing overall temporal and structural behavior, while a secondary sound is infused throughout, enriching its timbral and textural qualities. Objective evaluations and listening tests show that Mix2Morph outperforms prior baselines and produces high-quality sound infusions across diverse categories, representing a step toward more controllable and concept-driven tools for sound design. Sound examples are available at https://anniejchu.github.io/mix2morph .
翻译:我们提出了Mix2Morph,这是一个经过微调的文本到音频扩散模型,用于在没有专门渐变数据集的情况下执行声音渐变。通过在较高扩散时间步上对含噪的代理混合音频进行微调,Mix2Morph能够生成稳定、感知上连贯的渐变结果,令人信服地融合了两个源音频的特性。我们特别针对声音注入这一子类进行研究,这是基于实践和感知动机定义的一种渐变类型:其中一个声音作为主导的主要源,提供整体的时间结构和行为框架,而另一个次要声音则被注入其中,丰富其音色和纹理品质。客观评估和听音测试表明,Mix2Morph优于现有基线方法,并能跨不同类别生成高质量的声音注入效果,这标志着向更可控、更概念驱动的声音设计工具迈进了一步。音频示例可在 https://anniejchu.github.io/mix2morph 获取。