We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.
翻译:我们提出SoundStorm,一种用于高效非自回归音频生成的模型。SoundStorm以AudioLM的语义令牌作为输入,借助双向注意力机制和基于置信度的并行解码来生成神经音频编解码器的令牌。与AudioLM的自回归生成方法相比,我们的模型在保持相同音频质量并提升语音与声学条件一致性的同时,实现了两个数量级的加速。SoundStorm在TPU-v4上可在0.5秒内生成30秒音频。通过合成高质量的自然对话片段(基于标注了说话者轮次的脚本和简短的话者声音提示),我们证明了模型将音频生成扩展到更长序列的能力。