InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation

Chong Zhang,Yukun Ma,Qian Chen,Wen Wang,Shengkui Zhao,Zexu Pan,Hao Wang,Chongjia Ni,Trung Hieu Nguyen,Kun Zhou,Yidi Jiang,Chaohong Tan,Zhifu Gao,Zhihao Du,Bin Ma

from arxiv, Work in progress. Correspondence regarding this technical report should be directed to {chong.zhang, yukun.ma}@alibaba-inc.com. Online demo available on https://modelscope.cn/studios/iic/InspireMusic and https://huggingface.co/spaces/FunAudioLLM/InspireMusic

We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.

翻译：本文提出InspireMusic，一种融合超分辨率与大语言模型的高保真长时音乐生成框架。该统一框架通过结合自回归Transformer与超分辨率流匹配模型，能够生成高保真度的音乐、歌曲及音频。该框架支持从文本和音频提示出发，以更高采样率可控地生成高保真长时音乐。与先前方法不同，本模型采用包含更丰富语义信息的单码本音频分词器，从而降低训练成本并提升效率。该设计使得我们能够实现长达$8$分钟且具有连贯性的高质量音频生成。具体而言，首先基于Qwen 2.5的自回归Transformer模型预测音频词元；随后采用超分辨率流匹配模型，结合从声学编解码模型学习到的细粒度特征，生成高采样率的音频。综合实验表明，InspireMusic-1.5B-Long模型在主观与客观评估中，与当前顶尖开源系统（包括MusicGen和Stable Audio 2.0）具有相当的性能。代码与预训练模型已发布于https://github.com/FunAudioLLM/InspireMusic。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日