Zero-shot voice conversion (VC) converts source speech into the voice of any desired speaker using only one utterance of the speaker without requiring additional model updates. Typical methods use a speaker representation from a pre-trained speaker verification (SV) model or learn speaker representation during VC training to achieve zero-shot VC. However, existing speaker modeling methods overlook the variation of speaker information richness in temporal and frequency channel dimensions of speech. This insufficient speaker modeling hampers the ability of the VC model to accurately represent unseen speakers who are not in the training dataset. In this study, we present a robust zero-shot VC model with multi-level temporal-channel retrieval, referred to as MTCR-VC. Specifically, to flexibly adapt to the dynamic-variant speaker characteristic in the temporal and channel axis of the speech, we propose a novel fine-grained speaker modeling method, called temporal-channel retrieval (TCR), to find out when and where speaker information appears in speech. It retrieves variable-length speaker representation from both temporal and channel dimensions under the guidance of a pre-trained SV model. Besides, inspired by the hierarchical process of human speech production, the MTCR speaker module stacks several TCR blocks to extract speaker representations from multi-granularity levels. Furthermore, to achieve better speech disentanglement and reconstruction, we introduce a cycle-based training strategy to simulate zero-shot inference recurrently. We adopt perpetual constraints on three aspects, including content, style, and speaker, to drive this process. Experiments demonstrate that MTCR-VC is superior to the previous zero-shot VC methods in modeling speaker timbre while maintaining good speech naturalness.
翻译:零样本语音转换(Zero-shot Voice Conversion, VC)旨在仅利用目标说话人的一段语音,无需额外的模型更新,即可将源语音转换为任意目标说话人的声音。典型方法使用来自预训练说话人验证(SV)模型的说话人表征,或在VC训练期间学习说话人表征来实现零样本VC。然而,现有说话人建模方法忽视了语音在时间和频率通道维度上说话人信息丰富度的变化。这种不充分的说话人建模削弱了VC模型准确表征训练数据集中未出现的目标说话人的能力。本研究提出一种基于多层级时频通道检索的鲁棒零样本VC模型,称为MTCR-VC。具体而言,为灵活适应语音时间和通道轴上动态变化的说话人特征,我们提出一种新型细粒度说话人建模方法——时频通道检索(TCR),以定位说话人信息在语音中出现的位置与时机。该方法在预训练SV模型引导下,从时间和通道两个维度检索变长说话人表征。此外,受人类语音产生层级过程的启发,MTCR说话人模块堆叠多个TCR块,从多粒度层级提取说话人表征。同时,为获得更优的语音解耦与重构,我们引入基于循环的训练策略以循环模拟零样本推理过程,并采用内容、风格和说话人三方面的感知约束来驱动该过程。实验表明,MTCR-VC在建模说话人音色方面优于以往零样本VC方法,同时保持了良好的语音自然度。