StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion

Recent language model (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experiments demonstrate StreamVoice's streaming conversion capability while achieving zero-shot performance comparable to non-streaming VC systems.

翻译：近年来，语言模型（LM）的进展在零样本语音转换（VC）任务中展现出令人印象深刻的性能。然而，现有的基于 LM 的 VC 模型通常采用从源语义到声学特征的离线转换方式，需要完整的源语音信号，这限制了其在实时应用中的部署。本文提出 StreamVoice，一种新颖的基于 LM 的可流式零样本 VC 模型，能够在给定任意说话人提示和源语音的情况下实现实时转换。具体而言，为实现流式处理能力，StreamVoice 采用了一个完全因果的上下文感知 LM 以及一个时间无关的声学预测器，同时在自回归的每个时间步交替处理语义和声学特征，从而消除了对完整源语音的依赖。为缓解流式处理中因上下文不完整可能导致的性能下降，我们通过两种策略增强了 LM 的上下文感知能力：1）教师引导的上下文预见，在训练时使用教师模型总结当前及未来的语义上下文，以指导模型对缺失上下文的预测；2）语义掩码策略，促使模型基于先前被破坏的语义和声学输入进行声学预测，从而增强其上下文学习能力。值得注意的是，StreamVoice 是首个无需任何未来信息前瞻的、基于 LM 的可流式零样本 VC 模型。实验表明，StreamVoice 在实现与离线非流式 VC 系统相当的零样本性能的同时，具备了流式转换能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日