Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice Conversion

Zero-shot voice conversion (VC) converts source speech into the voice of any desired speaker using only one utterance of the speaker without requiring additional model updates. Typical methods use a speaker representation from a pre-trained speaker verification (SV) model or learn speaker representation during VC training to achieve zero-shot VC. However, existing speaker modeling methods overlook the variation of speaker information richness in temporal and frequency channel dimensions of speech. This insufficient speaker modeling hampers the ability of the VC model to accurately represent unseen speakers who are not in the training dataset. In this study, we present a robust zero-shot VC model with multi-level temporal-channel retrieval, referred to as MTCR-VC. Specifically, to flexibly adapt to the dynamic-variant speaker characteristic in the temporal and channel axis of the speech, we propose a novel fine-grained speaker modeling method, called temporal-channel retrieval (TCR), to find out when and where speaker information appears in speech. It retrieves variable-length speaker representation from both temporal and channel dimensions under the guidance of a pre-trained SV model. Besides, inspired by the hierarchical process of human speech production, the MTCR speaker module stacks several TCR blocks to extract speaker representations from multi-granularity levels. Furthermore, to achieve better speech disentanglement and reconstruction, we introduce a cycle-based training strategy to simulate zero-shot inference recurrently. We adopt perpetual constraints on three aspects, including content, style, and speaker, to drive this process. Experiments demonstrate that MTCR-VC is superior to the previous zero-shot VC methods in modeling speaker timbre while maintaining good speech naturalness.

翻译：零样本语音转换（Zero-shot VC）旨在仅使用目标说话人的单条语音片段，将其声纹转换至任意未见说话人，无需额外模型更新。典型方法采用预训练说话人验证（SV）模型提取说话人表征，或在语音转换训练过程中学习说话人表征以实现零样本转换。然而，现有说话人建模方法忽略了语音在时域和频域通道维度上说话人信息丰富度的差异性。这种不充分的说话人建模阻碍了语音转换模型准确表征训练数据集中未见说话人的能力。本研究提出一种具有多级时频检索能力的鲁棒零样本语音转换模型，命名为MTCR-VC。具体而言，为灵活适配语音时域和通道轴上动态变化的说话人特征，我们提出一种新颖的细粒度说话人建模方法——时频检索（TCR），用以定位说话人信息在语音中出现的时空位置。该方法在预训练SV模型引导下，从时域和频域两个维度检索变长说话人表征。此外，受人类语音产生的层次化过程启发，MTCR说话人模块堆叠多个TCR模块以提取多粒度级别的说话人表征。为提升语音解纠缠与重构性能，我们引入基于循环的训练策略来递归模拟零样本推理过程，并通过内容、风格和说话人三个维度的循环一致性约束驱动该过程。实验表明，MTCR-VC在保持良好语音自然度的同时，在说话人音色建模方面优于现有零样本语音转换方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日