Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation remains an essential issue to be addressed, especially for AR models that are incorporated in large language models (LLMs). In this paper, we analyze the token length of audio tokenization and propose a novel \textbf{S}cale-level \textbf{A}udio \textbf{T}okenizer (SAT), with improved residual quantization. Based on SAT, a scale-level \textbf{A}coustic \textbf{A}uto\textbf{R}egressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction, significantly reducing the training cost and inference time. To validate the effectiveness of the proposed approach, we comprehensively analyze design choices and demonstrate the proposed AAR framework achieves a remarkable \textbf{35}$\times$ faster inference speed and +\textbf{1.33} Fr\'echet Audio Distance (FAD) against baselines on the AudioSet benchmark. Code: \url{https://github.com/qiuk2/AAR}.
翻译:随着扩散模型和自回归模型等复杂生成模型的进步,音频生成已取得显著进展。然而,由于音频序列长度天然较长,音频生成的效率仍是一个亟待解决的关键问题,尤其对于集成于大语言模型中的自回归模型而言。本文分析了音频标记化的标记长度,并提出了一种新型的**尺度级音频标记器**,该标记器采用改进的残差量化方法。基于此,进一步提出了尺度级**声学自回归**建模框架,将下一标记的自回归预测转变为下一尺度的自回归预测,从而显著降低了训练成本与推理时间。为验证所提方法的有效性,我们全面分析了设计选择,并证明所提出的AAR框架在AudioSet基准测试中实现了**35倍**的推理速度提升,且Fréchet音频距离指标较基线模型优化了**+1.33**。代码地址:\url{https://github.com/qiuk2/AAR}。