Diffusion models have emerged as powerful deep generative techniques, producing high-quality and diverse samples in applications in various domains including audio. While existing reviews provide overviews, there remains limited in-depth discussion of these specific design choices. The audio diffusion model literature also lacks principled guidance for the implementation of these design choices and their comparisons for different applications. This survey provides a comprehensive review of diffusion model design with an emphasis on design principles for quality improvement and conditioning for audio applications. We adopt the score modeling perspective as a unifying framework that accommodates various interpretations, including recent approaches like flow matching. We systematically examine the training and sampling procedures of diffusion models, and audio applications through different conditioning mechanisms. To provide an integrated, unified codebase and to promote reproducible research and rapid prototyping, we introduce an open-source codebase (https://github.com/gzhu06/AudioDiffuser) that implements our reviewed framework for various audio applications. We demonstrate its capabilities through three case studies: audio generation, speech enhancement, and text-to-speech synthesis, with benchmark evaluations on standard datasets.
翻译:扩散模型已成为强大的深度生成技术,在包括音频在内的多个领域应用中能够生成高质量且多样化的样本。尽管现有综述提供了概览,但针对这些具体设计选择的深入讨论仍然有限。音频扩散模型文献也缺乏关于这些设计选择的实施原则及其在不同应用场景中比较的指导。本综述全面回顾了扩散模型设计,重点关注音频应用中质量改进的设计原则与条件化机制。我们采用分数建模视角作为统一框架,该框架兼容多种解释范式,包括流匹配等近期方法。我们系统性地考察了扩散模型的训练与采样流程,以及通过不同条件化机制实现的音频应用。为提供集成化的统一代码库并促进可复现研究和快速原型开发,我们引入了一个开源代码库(https://github.com/gzhu06/AudioDiffuser),该库为我们所综述的框架实现了多种音频应用。我们通过三个案例研究展示其能力:音频生成、语音增强和文本到语音合成,并在标准数据集上进行了基准评估。