We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.
翻译:本文提出InspireMusic,一种融合超分辨率与大语言模型的高保真长时音乐生成框架。该统一框架通过结合自回归Transformer与超分辨率流匹配模型,能够生成高保真度的音乐、歌曲及音频。该框架支持从文本和音频提示出发,以更高采样率可控地生成高保真长时音乐。与先前方法不同,本模型采用包含更丰富语义信息的单码本音频分词器,从而降低训练成本并提升效率。该设计使得我们能够实现长达$8$分钟且具有连贯性的高质量音频生成。具体而言,首先基于Qwen 2.5的自回归Transformer模型预测音频词元;随后采用超分辨率流匹配模型,结合从声学编解码模型学习到的细粒度特征,生成高采样率的音频。综合实验表明,InspireMusic-1.5B-Long模型在主观与客观评估中,与当前顶尖开源系统(包括MusicGen和Stable Audio 2.0)具有相当的性能。代码与预训练模型已发布于https://github.com/FunAudioLLM/InspireMusic。