Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.
翻译:深度生成模型能够基于多种表示形式(如梅尔频谱图、梅尔频率倒谱系数(MFCC))生成高保真音频。近年来,此类模型已被用于从高度压缩的表示合成音频波形。尽管这类方法取得了令人瞩目的成果,但当条件信息存在缺陷或不完美时,它们容易产生可感知的伪影。另一种替代建模方法是使用扩散模型,然而目前扩散模型主要被用作语音声码器(即基于梅尔频谱图)或生成采样率相对较低的信号。在本工作中,我们提出了一种基于多频段扩散的高保真框架,能够从低比特率离散表示生成任意类型的音频模态(例如语音、音乐、环境声音)。在相同比特率下,所提方法在感知质量上优于现有最先进的生成技术。训练代码、评估代码以及音频样本均已发布于 facebookresearch/audiocraft Github 页面。