Recent language model (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech, and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experimental results demonstrate StreamVoice's streaming conversion capability while maintaining zero-shot performance comparable to non-streaming VC systems.
翻译:近年来,语言模型的进步在零样本语音转换方面展现了令人瞩目的性能。然而,现有基于语言模型的语音转换模型通常采用从源语义到声学特征的离线转换方式,这需要完整的源语音输入,限制了其在实时应用中的部署。本文提出StreamVoice——一种新颖的基于流式语言模型的零样本语音转换模型,可在给定任意说话人提示和源语音的情况下实现实时转换。具体而言,为实现流式能力,StreamVoice采用全因果上下文感知语言模型与时序无关的声学预测器,并在自回归的每个时间步交替处理语义与声学特征,从而消除对完整源语音的依赖。针对流式处理中不完整上下文可能导致的性能下降问题,我们通过两种策略增强语言模型的上下文感知能力:1)教师引导的上下文预判——在训练阶段利用教师模型总结当前及未来语义上下文,引导模型对缺失上下文的预测;2)语义掩蔽策略——通过从受损的先前语义和声学输入中促进声学预测,增强上下文学习能力。值得注意的是,StreamVoice是首个无需任何未来信息预判的基于语言模型的流式零样本语音转换模型。实验结果表明,StreamVoice不仅具备流式转换能力,其零样本性能仍可媲美非流式语音转换系统。