Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation remains an essential issue to be addressed, especially for AR models that are incorporated in large language models (LLMs). In this paper, we analyze the token length of audio tokenization and propose a novel \textbf{S}cale-level \textbf{A}udio \textbf{T}okenizer (SAT), with improved residual quantization. Based on SAT, a scale-level \textbf{A}coustic \textbf{A}uto\textbf{R}egressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction, significantly reducing the training cost and inference time. To validate the effectiveness of the proposed approach, we comprehensively analyze design choices and demonstrate the proposed AAR framework achieves a remarkable \textbf{35}$\times$ faster inference speed and +\textbf{1.33} Fr\'echet Audio Distance (FAD) against baselines on the AudioSet benchmark. Code: \url{https://github.com/qiuk2/AAR}.

翻译：随着扩散模型和自回归模型等复杂生成模型的进步，音频生成已取得显著进展。然而，由于音频序列长度天然较长，音频生成的效率仍是一个亟待解决的关键问题，尤其对于集成于大语言模型中的自回归模型而言。本文分析了音频标记化的标记长度，并提出了一种新型的**尺度级音频标记器**，该标记器采用改进的残差量化方法。基于此，进一步提出了尺度级**声学自回归**建模框架，将下一标记的自回归预测转变为下一尺度的自回归预测，从而显著降低了训练成本与推理时间。为验证所提方法的有效性，我们全面分析了设计选择，并证明所提出的AAR框架在AudioSet基准测试中实现了**35倍**的推理速度提升，且Fréchet音频距离指标较基线模型优化了**+1.33**。代码地址：\url{https://github.com/qiuk2/AAR}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/