LeVo: High-Quality Song Generation with Multi-Preference Alignment

Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in audio quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, a language model based framework consisting of LeLM and Music Codec. LeLM is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve better vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following ability, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and post-training. Experimental results demonstrate that LeVo significantly outperforms existing open-source methods in both objective and subjective metrics, while performing competitively with industry systems. Ablation studies further justify the effectiveness of our designs. Audio examples and source code are available at https://levo-demo.github.io and https://github.com/tencent-ailab/songgeneration.

翻译：近年来，大型语言模型（LLMs）与音频语言模型的发展显著推动了音乐生成技术的进步，尤其在歌词到歌曲生成领域。然而，现有方法仍面临歌曲结构复杂与高质量数据稀缺的挑战，导致生成音频在音质、音乐性、指令遵循能力及人声-伴奏和谐度方面存在局限。为应对这些挑战，本文提出LeVo——一个基于语言模型的框架，包含LeLM与Music Codec两大组件。LeLM能够并行建模两种类型的表征：混合表征（用于编码人声与伴奏的混合音频以实现更佳的人声-伴奏和谐度）与双轨表征（分别编码人声与伴奏以实现高质量歌曲生成）。该模型采用两个仅含解码器的Transformer结构，并引入模块化扩展训练策略以避免不同表征类型间的相互干扰。为进一步提升音乐性与指令遵循能力，我们提出一种基于直接偏好优化（DPO）的多偏好对齐方法。该方法通过半自动数据构建流程与后训练阶段，实现对多样化人类偏好的有效对齐。实验结果表明，LeVo在客观与主观评价指标上均显著优于现有开源方法，并与业界先进系统性能相当。消融实验进一步验证了所提设计的有效性。音频示例与源代码已发布于https://levo-demo.github.io 与 https://github.com/tencent-ailab/songgeneration。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日