Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\%, comparable to dedicated TTS systems such as F5-TTS (5.21\%) and CosyVoice3 (5.30\%). On singing generation, UniVoice achieves a PER of 16.22\%, outperforming the unified baseline Vevo1.5 (24.72\%).
翻译:文本到语音(TTS)与歌声合成(SVS)均旨在从符号输入生成人声音频,但两者对生成过程提出了不同要求。语音生成依赖于灵活的语言驱动韵律,而歌声生成则需要明确的旋律控制与准确的节奏对齐。这种不匹配使得训练单一模型同时生成自然语音与可操控歌声充满挑战,因为与旋律相关的条件应严格约束歌声,却不应限制语音韵律。我们提出UniVoice,一种基于条件流匹配的语音与歌声统一生成框架。UniVoice摒弃使用单一无差别的条件表示,而是将条件分解为内容、旋律与音色三个要素,这些要素由适配模态的编码器进行编码,并由共享的扩散Transformer(DiT)主干网络处理。对于歌声,旋律条件以MIDI音符序列表示;对于语音,该条件被替换为学习得到的空旋律标记,使模型能够从语言和声学上下文中推断韵律。这种设计在保留歌声显式旋律控制的同时,避免了将旋律约束强加于语音。我们进一步将空旋律标记分析为条件流中旋律边际化的近似操作。基于3万小时语音与3.5万小时歌声数据训练,UniVoice在语音生成上实现了5.26%的字符错误率(PER),与专用TTS系统如F5-TTS(5.21%)和CosyVoice3(5.30%)性能相当。在歌声生成上,UniVoice达到16.22%的PER,优于统一基线模型Vevo1.5(24.72%)。