StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion

Recent language model (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech, and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experimental results demonstrate StreamVoice's streaming conversion capability while maintaining zero-shot performance comparable to non-streaming VC systems.

翻译：近期语言模型的进展在零样本语音转换任务上展现了令人瞩目的性能。然而现有基于语言模型的语音转换模型通常采用离线方式将源语音语义特征转换为声学特征，这不仅需要完整的源语音输入，还限制了其在实时场景中的应用。本文提出StreamVoice——一种基于流式语言模型的零样本语音转换新架构，能够基于任意说话人提示和源语音实现实时转换。具体而言，为实现流式处理能力，StreamVoice采用全因果上下文感知语言模型与时间无关声学预测器，通过自回归过程中各时间步交替处理语义与声学特征，消除对完整源语音的依赖。针对流式处理中不完整上下文可能导致的性能退化问题，我们通过两种策略增强语言模型的上下文感知能力：1）教师引导的上下文预判——训练阶段利用教师模型归纳当前及未来语义上下文，指导模型对缺失上下文的预测；2）语义遮蔽策略——通过从受损的语义与声学输入序列中预测声学特征，增强上下文学习能力。值得关注的是，StreamVoice是首个无需任何未来帧预见的基于语言模型的流式零样本语音转换方案。实验结果表明，StreamVoice在保持与非流式语音转换系统相当零样本性能的同时，实现了流式转换能力。