Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-shot framework capable of handling all three scenarios within a single model. OneVoice is built upon a continuous language model trained with VAE-free next-patch diffusion, ensuring high fidelity and efficient sequence modeling. Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity. Expert selection is coordinated by a dual-path routing mechanism, including shared expert isolation and scenario-aware domain expert assignment with global-local cues. For precise conditioning, scenario-specific prosodic features are fused into each layer via a gated mechanism, allowing adaptive usage of prosody information. Furthermore, to enable the core idea and alleviate the imbalanced issue (abundant speech vs. scarce singing), we adopt a two-stage progressive training that includes foundational pre-training and scenario enhancement with LoRA-based domain experts. Experiments show that OneVoice matches or surpasses specialized models across all three scenarios, while verifying flexible control over scenarios and offering a fast decoding version as few as 2 steps. Code and model will be released soon.
翻译:语音转换(VC)领域的最新进展已在说话人克隆和语言信息保持方面取得了新的里程碑。然而,该领域仍处于割裂状态,依赖于针对语言保持、情感表达和歌唱场景的专用模型。我们提出了OneVoice,一个统一的零样本框架,能够在单一模型中处理所有三种场景。OneVoice基于一种连续语言模型构建,该模型采用无VAE的下一片段扩散进行训练,确保了高保真度和高效的序列建模。其统一设计的核心在于一个混合专家(MoE)模块,旨在显式建模共享的转换知识和场景特定的表现力。专家选择由双路径路由机制协调,包括共享专家隔离以及利用全局-局部线索进行场景感知的领域专家分配。为实现精确的条件控制,场景特定的韵律特征通过门控机制融合到每一层中,从而允许自适应地利用韵律信息。此外,为实现核心思想并缓解数据不平衡问题(语音数据丰富而歌唱数据稀缺),我们采用了两阶段渐进式训练,包括基础预训练和基于LoRA的领域专家进行的场景增强。实验表明,OneVoice在所有三种场景中均达到或超越了专用模型的性能,同时验证了对场景的灵活控制,并提供了少至2步的快速解码版本。代码和模型即将发布。