HeartMuLa: A Family of Open Sourced Music Foundation Models

Dongchao Yang,Yuxin Xie,Yuguo Yin,Zheyu Wang,Xiaoyu Yi,Gongxi Zhu,Xiaolong Weng,Zihan Xiong,Yingzhe Ma,Dading Cong,Jingliang Liu,Zihang Huang,Jinghan Ru,Rongjie Huang,Haoran Wan,Peixu Wang,Kuoxi Yu,Helin Wang,Liming Liang,Xianwei Zhuang,Yuanyuan Wang, Dingdong, Wang,Haohan Guo,Junjie Cao,Zeqian Ju,Songxiang Liu,Yuewen Cao,Heming Weng,Yuexian Zou

We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric recognition model optimized for real-world music scenarios; and (3) HeartCodec, a low-frame-rate (12.5 Hz) yet high-fidelity music codec tokenizer that captures long-range musical structure while preserving fine-grained acoustic details and enabling efficient autoregressive modeling; (4) HeartMuLa, an LLM-based song generation model capable of synthesizing high-fidelity music under rich, user-controllable conditions (e.g., textual style descriptions, lyrics, and reference audio). In addition, it provides two specialized modes: (i) fine-grained musical attribute control, which allows users to specify the style of different song sections (e.g., intro, verse, chorus) using natural language prompts; and (ii) short, engaging music generation, which is suitable as background music for short videos. Lastly, HeartMuLa improves significantly when scaled to 7B parameters. For the first time, we show that a Suno-level, commercial-grade system can be reproduced using academic-scale data and GPU resources. We expect these foundation models to serve as strong baselines for future research and to facilitate practical applications in multimodal content production.

翻译：本文介绍了一个开源音乐基础模型家族，旨在推动跨任务与跨模态的大规模音乐理解与生成。该框架包含四个核心组件：(1) HeartCLAP，一种音频-文本对齐模型；(2) HeartTranscriptor，专为真实音乐场景优化的鲁棒歌词识别模型；(3) HeartCodec，一种低帧率（12.5 Hz）高保真音乐编解码分词器，能够在保留细粒度声学细节的同时捕捉长程音乐结构，并支持高效的自回归建模；(4) HeartMuLa，一种基于大语言模型的歌曲生成模型，能够在丰富且用户可控的条件下（如文本风格描述、歌词及参考音频）合成高保真音乐。此外，该模型提供两种专用模式：(i) 细粒度音乐属性控制，允许用户通过自然语言提示指定歌曲不同段落（如前奏、主歌、副歌）的风格；(ii) 简短悦耳的音乐生成，适用于短视频背景音乐。最后，当模型规模扩展至70亿参数时，HeartMuLa性能显著提升。我们首次证明，利用学术规模的数据与GPU资源即可复现达到Suno级别的商用系统。我们期望这些基础模型能为未来研究提供有力基准，并促进多模态内容生产的实际应用。