Recent advancements in song generation have shown promising results in generating songs from lyrics and/or global text prompts. However, most existing systems lack the ability to model the temporally varying attributes of songs, limiting fine-grained control over musical structure and dynamics. In this paper, we propose SegTune, a non-autoregressive framework for structured and controllable song generation. SegTune enables segment-level control by allowing users or large language models to specify local musical descriptions aligned to song sections.The segmental prompts are injected into the model by temporally broadcasting them to corresponding time windows, while global prompts influence the whole song to ensure stylistic coherence. To obtain accurate segment durations and enable precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamped lyrics in LRC format. We further construct a large-scale data pipeline for collecting high-quality songs with aligned lyrics and prompts, and propose new evaluation metrics to assess segment-level alignment and vocal attribute consistency. Experimental results show that SegTune achieves superior controllability and musical coherence compared to existing baselines. See https://cai525.github.io/SegTune_demo for demos of our work.
翻译:近年来,歌曲生成领域取得了显著进展,可从歌词和/或全局文本提示生成歌曲。然而,现有系统大多缺乏对歌曲时变属性的建模能力,限制了音乐结构与动态变化的精细控制。本文提出SegTune,一种用于结构化可控歌曲生成的非自回归框架。SegTune通过允许用户或大语言模型指定与歌曲段落对齐的局部音乐描述,实现了片段级控制。片段级提示通过时间广播机制注入对应时间窗口,而全局提示则影响整首歌曲以保持风格一致性。为获取精确的片段时长并实现精准的歌词-音乐对齐,我们引入基于大语言模型的时长预测器,以自回归方式生成LRC格式的句子级带时间戳歌词。进一步地,我们构建了大规模数据流水线以收集具有对齐歌词和提示的高质量歌曲,并提出新的评估指标来衡量片段级对齐与声学属性一致性。实验结果表明,与现有基线相比,SegTune在可控性与音乐连贯性方面均表现更优。演示详见https://cai525.github.io/SegTune_demo。