Diffsound: Discrete Diffusion Model for Text-to-sound Generation

Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform. We found that the decoder significantly influences the generation performance. Thus, we focus on designing a good decoder in this study. We begin with the traditional autoregressive decoder, which has been proved as a state-of-the-art method in previous sound generation works. However, the AR decoder always predicts the mel-spectrogram tokens one by one in order, which introduces the unidirectional bias and accumulation of errors problems. Moreover, with the AR decoder, the sound generation time increases linearly with the sound duration. To overcome the shortcomings introduced by AR decoders, we propose a non-autoregressive decoder based on the discrete diffusion model, named Diffsound. Specifically, the Diffsound predicts all of the mel-spectrogram tokens in one step and then refines the predicted tokens in the next step, so the best-predicted results can be obtained after several steps. Our experiments show that our proposed Diffsound not only produces better text-to-sound generation results when compared with the AR decoder but also has a faster generation speed, e.g., MOS: 3.56 \textit{v.s} 2.786, and the generation speed is five times faster than the AR decoder.

翻译：生成人类所需的声音效果是一个重要课题，但目前针对声音生成的研究尚不充分。本研究探索基于文本提示的条件声音生成，并提出了一个新颖的文本到声音生成框架，该框架由文本编码器、向量量化变分自编码器（VQ-VAE）、解码器和声码器组成。该框架首先利用解码器将文本编码器提取的文本特征与VQ-VAE结合转换为梅尔频谱图，随后通过声码器将生成的梅尔频谱图转化为波形。研究发现解码器对生成性能影响显著，因此本研究着重于设计高效解码器。我们首先采用传统自回归解码器，该解码器在先前声音生成工作中被证明是最优方法。然而，自回归解码器需按顺序逐个预测梅尔频谱图标记，这引入了单向性偏差和误差累积问题。此外，使用自回归解码器时，声音生成时间随声音时长线性增长。为克服自回归解码器的局限性，我们提出基于离散扩散模型的非自回归解码器Diffsound。具体而言，Diffsound在单步中预测所有梅尔频谱图标记，随后在下一步骤中优化预测结果，通过多步迭代获得最优预测。实验表明，与自回归解码器相比，Diffsound不仅生成更优的文本到声音效果（MOS评分：3.56 vs 2.786），且生成速度提升五倍。