SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling

Recent song generation systems can synthesize realistic audio, yet generating complete songs remains challenging for two reasons. First, explicit song-level arrangement planning remains limited in existing methods, so models often need to organize overall arrangement development while generating low-level audio details. This often leads to incoherence in arrangements, such as weak section transitions and limited dynamic progression. Second, coarse modeling of different musical parts obscures their distinct roles and interactions, limiting arrangement richness of generated songs. In this paper, we present SketchSong, a hierarchical song generation framework that addresses these issues through song-level sketch planning and fine-grained multi-track modeling. Along the temporal dimension, SketchSong first predicts a compact sequence of high-level sketch tokens derived from compressed audio representations, and then generates audio tokens conditioned on these sketches. This coarse-to-fine process gives the model an explicit arrangement plan before detailed audio generation. Along the track dimension, SketchSong explicitly models four tracks, i.e., vocals, bass, drums and other instruments. This enables the model to capture the roles and interactions of different musical parts more precisely. Experiments on song generation benchmarks show that SketchSong consistently outperforms our baseline on both objective metrics and human listening tests. Despite not employing additional post-training for preference optimization such as lyrics and text-prompt alignments, SketchSong achieves competitive results against strong, post-trained open-source systems, demonstrating the effectiveness of our overall design.

翻译：近年来的歌曲生成系统能够合成逼真的音频，但生成完整歌曲仍面临两大挑战。首先，现有方法在显式的歌曲级编排规划方面存在局限，模型常需在生成底层音频细节的同时组织整体编排发展，这导致编排连贯性不足，例如乐段过渡薄弱、动态演进受限。其次，对不同音乐声部的粗略建模掩盖了其独特功能与交互关系，限制了生成歌曲的编排丰富度。本文提出SketchSong——一种通过歌曲级草图规划与细粒度多轨建模解决上述问题的层次化歌曲生成框架。在时间维度上，SketchSong首先预测由压缩音频表征衍生的紧凑高层草图标记序列，再基于这些草图生成音频标记。这种从粗到精的过程使模型在生成详细音频前获得显式的编排规划。在音轨维度上，SketchSong对人声、贝斯、鼓及其他乐器四个音轨进行显式建模，使模型能更精确地捕捉不同音乐声部的角色与交互。在歌曲生成基准上的实验表明，SketchSong在客观指标与人类听测中均持续优于基线系统。尽管未采用歌词与文本提示对齐等偏好优化的额外后训练，SketchSong仍能与经过后训练的强开源系统相媲美，验证了整体设计的有效性。