Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.
翻译:基于扩散的歌声合成技术近期在表现力方面取得了显著进展,但仍受限于数据稀缺与模型可扩展性。本文提出一种两阶段流程:首先,通过将固定旋律与多样化大语言模型生成的歌词配对,构建一个由人声演唱录音组成的紧凑种子集,并训练旋律特定模型以合成超过500小时的高质量中文歌声数据。基于此语料库,我们提出DiTSinger——一种融合RoPE与qk-norm的扩散Transformer模型,通过在深度、宽度与分辨率维度进行系统性扩展以提升保真度。此外,我们设计了一种隐式对齐机制,通过将音素-声学注意力约束在字符级跨度内,消除了对音素级时长标签的依赖,从而提升了在噪声或不确定对齐情况下的鲁棒性。大量实验验证了本方法能够实现可扩展、免对齐且高保真的歌声合成。