Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-shot framework capable of handling all three scenarios within a single model. OneVoice is built upon a continuous language model trained with VAE-free next-patch diffusion, ensuring high fidelity and efficient sequence modeling. Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity. Expert selection is coordinated by a dual-path routing mechanism, including shared expert isolation and scenario-aware domain expert assignment with global-local cues. For precise conditioning, scenario-specific prosodic features are fused into each layer via a gated mechanism, allowing adaptive usage of prosody information. Furthermore, to enable the core idea and alleviate the imbalanced issue (abundant speech vs. scarce singing), we adopt a two-stage progressive training that includes foundational pre-training and scenario enhancement with LoRA-based domain experts. Experiments show that OneVoice matches or surpasses specialized models across all three scenarios, while verifying flexible control over scenarios and offering a fast decoding version as few as 2 steps. Audio samples are available on demo page.
翻译:语音转换的最新进展在说话人克隆和语言保留方面取得了新的里程碑。然而,该领域仍存在碎片化问题,依赖专用模型处理语言保留、情感表达和歌唱等不同场景。我们提出了OneVoice,一个统一的零样本框架,能够在单一模型中处理所有三种场景。OneVoice基于连续语言模型构建,采用无VAE的下一块扩散方法,确保了高保真度和高效的序列建模。其统一的核心设计在于一个专家混合(MoE)模块,用于显式建模共享的转换知识和场景特定的表现力。专家选择由双路径路由机制协调,包括共享专家隔离和基于全局-局部线索的场景感知领域专家分配。为实现精确的条件控制,每个层通过门控机制融合场景特定的韵律特征,从而自适应地使用韵律信息。此外,为实现核心理念并缓解不平衡问题(丰富语音数据 vs. 稀缺歌唱数据),我们采用了两阶段渐进式训练,包括基础预训练和基于LoRA的领域专家场景增强。实验表明,OneVoice在三种场景下均匹配或超越了专用模型,同时验证了场景的灵活控制能力,并提供了仅需2步的快速解码版本。音频样本可在演示页面获取。