UniVoice: A Unified Model for Speech and Singing Voice Generation

Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\%, comparable to dedicated TTS systems such as F5-TTS (5.21\%) and CosyVoice3 (5.30\%). On singing generation, UniVoice achieves a PER of 16.22\%, outperforming the unified baseline Vevo1.5 (24.72\%).

翻译：文本到语音（TTS）与歌声合成（SVS）均旨在从符号输入生成人声音频，但两者对生成过程提出了不同要求。语音生成依赖于灵活的语言驱动韵律，而歌声生成则需要明确的旋律控制与准确的节奏对齐。这种不匹配使得训练单一模型同时生成自然语音与可操控歌声充满挑战，因为与旋律相关的条件应严格约束歌声，却不应限制语音韵律。我们提出UniVoice，一种基于条件流匹配的语音与歌声统一生成框架。UniVoice摒弃使用单一无差别的条件表示，而是将条件分解为内容、旋律与音色三个要素，这些要素由适配模态的编码器进行编码，并由共享的扩散Transformer（DiT）主干网络处理。对于歌声，旋律条件以MIDI音符序列表示；对于语音，该条件被替换为学习得到的空旋律标记，使模型能够从语言和声学上下文中推断韵律。这种设计在保留歌声显式旋律控制的同时，避免了将旋律约束强加于语音。我们进一步将空旋律标记分析为条件流中旋律边际化的近似操作。基于3万小时语音与3.5万小时歌声数据训练，UniVoice在语音生成上实现了5.26%的字符错误率（PER），与专用TTS系统如F5-TTS（5.21%）和CosyVoice3（5.30%）性能相当。在歌声生成上，UniVoice达到16.22%的PER，优于统一基线模型Vevo1.5（24.72%）。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【综述】大型音频语言模型综述：泛化、可信与未来展望

专知会员服务

14+阅读 · 5月21日

【普林斯顿博士论文】用于语音的生成式通用模型

专知会员服务

19+阅读 · 2025年12月3日

迈向可控语音合成：大语言模型时代的综述

专知会员服务

24+阅读 · 2024年12月13日

《语音大语言模型》最新进展综述

专知会员服务

58+阅读 · 2024年10月8日