With the development of audio playback devices and fast data transmission, the demand for high sound quality is rising for both entertainment and communications. In this quest for better sound quality, challenges emerge from distortions and interferences originating at the recording side or caused by an imperfect transmission pipeline. To address this problem, audio restoration methods aim to recover clean sound signals from the corrupted input data. We present here audio restoration algorithms based on diffusion models, with a focus on speech enhancement and music restoration tasks. Traditional approaches, often grounded in handcrafted rules and statistical heuristics, have shaped our understanding of audio signals. In the past decades, there has been a notable shift towards data-driven methods that exploit the modeling capabilities of DNNs. Deep generative models, and among them diffusion models, have emerged as powerful techniques for learning complex data distributions. However, relying solely on DNN-based learning approaches carries the risk of reducing interpretability, particularly when employing end-to-end models. Nonetheless, data-driven approaches allow more flexibility in comparison to statistical model-based frameworks, whose performance depends on distributional and statistical assumptions that can be difficult to guarantee. Here, we aim to show that diffusion models can combine the best of both worlds and offer the opportunity to design audio restoration algorithms with a good degree of interpretability and a remarkable performance in terms of sound quality. We explain the diffusion formalism and its application to the conditional generation of clean audio signals. We believe that diffusion models open an exciting field of research with the potential to spawn new audio restoration algorithms that are natural-sounding and remain robust in difficult acoustic situations.
翻译:随着音频播放设备和快速数据传输技术的发展,娱乐和通信领域对高音质的需求日益增长。在追求更高音质的进程中,来自录制端或非理想传输链路引起的失真与干扰带来了诸多挑战。为解决这一问题,音频修复方法致力于从受损的输入数据中恢复纯净的音频信号。本文提出基于扩散模型的音频修复算法,重点关注语音增强与音乐修复任务。传统方法通常基于人工设计的规则和统计启发式策略,这些方法塑造了我们对音频信号的理解。过去数十年来,研究趋势显著转向利用深度神经网络建模能力的数据驱动方法。深度生成模型(特别是扩散模型)已成为学习复杂数据分布的有效技术。然而,完全依赖基于深度神经网络的学习方法可能会降低可解释性,尤其是在采用端到端模型时。尽管如此,与基于统计模型的框架相比,数据驱动方法具有更高的灵活性——后者的性能依赖于难以保证的分布与统计假设。本文旨在证明扩散模型能够融合两类方法的优势,为设计兼具良好可解释性和卓越音质表现的音频修复算法提供可能。我们阐释了扩散模型的形式化框架及其在纯净音频信号条件生成中的应用。我们相信,扩散模型开辟了一个令人振奋的研究领域,有望催生音质自然且在复杂声学场景中保持鲁棒性的新型音频修复算法。