Recent language model (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech, and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experimental results demonstrate StreamVoice's streaming conversion capability while maintaining zero-shot performance comparable to non-streaming VC systems.
翻译:近年来,语言模型(LM)的进展在零样本语音转换(VC)任务中展现出卓越性能。然而,现有基于LM的VC模型通常采用从源语言到声学特征的离线转换方式,需要完整输入源语音,限制了其在实时场景中的应用。本文提出StreamVoice——一种基于流式LM的零样本VC新型模型,可在任意说话人提示与源语音条件下实现实时转换。具体而言,为赋予流式处理能力,StreamVoice采用全因果上下文感知LM结合时间无关声学预测器,在自回归的每个时间步交替处理语义与声学特征,从而消除对完整源语音的依赖。针对流式处理中上下文不完整可能导致的性能退化问题,我们通过两种策略增强LM的上下文感知能力:1)教师引导的上下文预判——训练阶段使用教师模型对当前及未来语义上下文进行总结,指导模型对缺失上下文的预测;2)语义掩蔽策略——通过从受损的语义与声学输入序列中推进声学预测,强化模型上下文学习能力。值得注意的是,StreamVoice是首个无需任何未来信息前瞻的基于LM的流式零样本VC模型。实验结果表明,StreamVoice在保持与非流式VC系统相当零样本性能的同时,成功实现了流式转换能力。