Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fr\'echet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/
翻译:音频扩散模型能够合成种类丰富的音频。现有模型通常在隐空间领域运行,并借助级联相位恢复模块重建波形。这给生成高保真音频带来了挑战。本文提出EDMSound,一种在阐明扩散模型(EDM)框架下于频谱图领域构建的基于扩散的生成模型。结合高效确定性采样器,我们仅需10步就达到了与顶尖基线相当的弗雷歇音频距离(FAD)分数,并在DCASE2023 Foley声音生成基准测试中,通过50步实现了最先进性能。我们还揭示了一个关于基于扩散的音频生成模型的潜在问题:它们倾向于生成与训练数据具有高度感知相似性的样本。项目页面:https://agentcooper2002.github.io/EDMSound/