Recent language model (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experiments demonstrate StreamVoice's streaming conversion capability while achieving zero-shot performance comparable to non-streaming VC systems.
翻译:近年来,语言模型(LM)的进展在零样本语音转换(VC)任务中展现出令人印象深刻的性能。然而,现有的基于 LM 的 VC 模型通常采用从源语义到声学特征的离线转换方式,需要完整的源语音信号,这限制了其在实时应用中的部署。本文提出 StreamVoice,一种新颖的基于 LM 的可流式零样本 VC 模型,能够在给定任意说话人提示和源语音的情况下实现实时转换。具体而言,为实现流式处理能力,StreamVoice 采用了一个完全因果的上下文感知 LM 以及一个时间无关的声学预测器,同时在自回归的每个时间步交替处理语义和声学特征,从而消除了对完整源语音的依赖。为缓解流式处理中因上下文不完整可能导致的性能下降,我们通过两种策略增强了 LM 的上下文感知能力:1)教师引导的上下文预见,在训练时使用教师模型总结当前及未来的语义上下文,以指导模型对缺失上下文的预测;2)语义掩码策略,促使模型基于先前被破坏的语义和声学输入进行声学预测,从而增强其上下文学习能力。值得注意的是,StreamVoice 是首个无需任何未来信息前瞻的、基于 LM 的可流式零样本 VC 模型。实验表明,StreamVoice 在实现与离线非流式 VC 系统相当的零样本性能的同时,具备了流式转换能力。