We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a data-efficient two-stage curriculum learning strategy that progressively trains a competitive TTS system to acquire the desired SCS capability. Addressing data scarcity, we introduce a scalable pipeline to synthesize diverse code-switching data that is both semantically and acoustically natural, alongside a new multi-scenario benchmark, SCSBench. To address limitations of semantic tokenizers in capturing acoustic details, we also introduce refined cent token and Chain-of-Thought (CoT) generation for planning prosody before content generation, effectively enhancing empathetic speech generation and singing melody. Experimental results demonstrate that UniVocal achieves state-of-the-art performance on SCSBench while maintaining competitive performance on regular speech and singing tasks. Audio samples are available at https://project-univocal-demo.github.io/demo/. The code and dataset are released at https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal.
翻译:我们提出UniVocal,一个统一框架,能够从文本上下文中隐式推断发声模式,开创了语音-歌唱语码切换(SCS)合成任务——在该任务中,转换由文本语义自主驱动,类似于人类无缝的语言混合。与单模式生成或依赖切换控制标签的系统不同,我们提出的UniVocal仅从文本上下文中隐式推断发声模式。为实现这一目标,我们采用了一种数据高效的两阶段课程学习策略,逐步训练一个具有竞争力的TTS系统,使其获得所需的SCS能力。针对数据稀缺问题,我们引入了一个可扩展的流水线,用于合成语义和声学上均自然的多样化语码切换数据,同时提出了一个新的多场景基准SCSBench。为解决语义分词器在捕捉声学细节方面的局限性,我们还引入了细化的分音标记和链式思维(CoT)生成,用于在内容生成前规划韵律,从而有效增强共情语音生成和歌唱旋律。实验结果表明,UniVocal在SCSBench上达到了最先进的性能,同时在常规语音和歌唱任务中保持了竞争力。音频样本可在https://project-univocal-demo.github.io/demo/获取。代码和数据集已发布于https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal。