Multi-level Temporal-channel Speaker Retrieval for Robust Zero-shot Voice Conversion

Zero-shot voice conversion (VC) converts source speech into the voice of any desired speaker using only one utterance of the speaker without requiring additional model updates. Typical methods use a speaker representation from a pre-trained speaker verification (SV) model or learn speaker representation during VC training to achieve zero-shot VC. However, existing speaker modeling methods overlook the variation of speaker information richness in temporal and frequency channel dimensions of speech. This insufficient speaker modeling hampers the ability of the VC model to accurately represent unseen speakers who are not in the training dataset. In this study, we present a robust zero-shot VC model with multi-level temporal-channel retrieval, referred to as MTCR-VC. Specifically, to flexibly adapt to the dynamic-variant speaker characteristic in the temporal and channel axis of the speech, we propose a novel fine-grained speaker modeling method, called temporal-channel retrieval (TCR), to find out when and where speaker information appears in speech. It retrieves variable-length speaker representation from both temporal and channel dimensions under the guidance of a pre-trained SV model. Besides, inspired by the hierarchical process of human speech production, the MTCR speaker module stacks several TCR blocks to extract speaker representations from multi-granularity levels. Furthermore, to achieve better speech disentanglement and reconstruction, we introduce a cycle-based training strategy to simulate zero-shot inference recurrently. We adopt perpetual constraints on three aspects, including content, style, and speaker, to drive this process. Experiments demonstrate that MTCR-VC is superior to the previous zero-shot VC methods in modeling speaker timbre while maintaining good speech naturalness.

翻译：零样本语音转换（Zero-shot Voice Conversion, VC）旨在仅利用目标说话人的一段语音，无需额外的模型更新，即可将源语音转换为任意目标说话人的声音。典型方法使用来自预训练说话人验证（SV）模型的说话人表征，或在VC训练期间学习说话人表征来实现零样本VC。然而，现有说话人建模方法忽视了语音在时间和频率通道维度上说话人信息丰富度的变化。这种不充分的说话人建模削弱了VC模型准确表征训练数据集中未出现的目标说话人的能力。本研究提出一种基于多层级时频通道检索的鲁棒零样本VC模型，称为MTCR-VC。具体而言，为灵活适应语音时间和通道轴上动态变化的说话人特征，我们提出一种新型细粒度说话人建模方法——时频通道检索（TCR），以定位说话人信息在语音中出现的位置与时机。该方法在预训练SV模型引导下，从时间和通道两个维度检索变长说话人表征。此外，受人类语音产生层级过程的启发，MTCR说话人模块堆叠多个TCR块，从多粒度层级提取说话人表征。同时，为获得更优的语音解耦与重构，我们引入基于循环的训练策略以循环模拟零样本推理过程，并采用内容、风格和说话人三方面的感知约束来驱动该过程。实验表明，MTCR-VC在建模说话人音色方面优于以往零样本VC方法，同时保持了良好的语音自然度。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【MM 2021】基于Transformer的动态人脸表情识别网络,Former-DFER: Dynamic Facial Expression Recognition Transformer

专知会员服务

21+阅读 · 2022年3月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日