Zero-shot voice conversion (VC) converts source speech into the voice of any desired speaker using only one utterance of the speaker without requiring additional model updates. Typical methods use a speaker representation from a pre-trained speaker verification (SV) model or learn speaker representation during VC training to achieve zero-shot VC. However, existing speaker modeling methods overlook the variation of speaker information richness in temporal and frequency channel dimensions of speech. This insufficient speaker modeling hampers the ability of the VC model to accurately represent unseen speakers who are not in the training dataset. In this study, we present a robust zero-shot VC model with multi-level temporal-channel retrieval, referred to as MTCR-VC. Specifically, to flexibly adapt to the dynamic-variant speaker characteristic in the temporal and channel axis of the speech, we propose a novel fine-grained speaker modeling method, called temporal-channel retrieval (TCR), to find out when and where speaker information appears in speech. It retrieves variable-length speaker representation from both temporal and channel dimensions under the guidance of a pre-trained SV model. Besides, inspired by the hierarchical process of human speech production, the MTCR speaker module stacks several TCR blocks to extract speaker representations from multi-granularity levels. Furthermore, to achieve better speech disentanglement and reconstruction, we introduce a cycle-based training strategy to simulate zero-shot inference recurrently. We adopt perpetual constraints on three aspects, including content, style, and speaker, to drive this process. Experiments demonstrate that MTCR-VC is superior to the previous zero-shot VC methods in modeling speaker timbre while maintaining good speech naturalness.
翻译:零样本语音转换(Zero-shot VC)旨在仅使用目标说话人的单条语音片段,将其声纹转换至任意未见说话人,无需额外模型更新。典型方法采用预训练说话人验证(SV)模型提取说话人表征,或在语音转换训练过程中学习说话人表征以实现零样本转换。然而,现有说话人建模方法忽略了语音在时域和频域通道维度上说话人信息丰富度的差异性。这种不充分的说话人建模阻碍了语音转换模型准确表征训练数据集中未见说话人的能力。本研究提出一种具有多级时频检索能力的鲁棒零样本语音转换模型,命名为MTCR-VC。具体而言,为灵活适配语音时域和通道轴上动态变化的说话人特征,我们提出一种新颖的细粒度说话人建模方法——时频检索(TCR),用以定位说话人信息在语音中出现的时空位置。该方法在预训练SV模型引导下,从时域和频域两个维度检索变长说话人表征。此外,受人类语音产生的层次化过程启发,MTCR说话人模块堆叠多个TCR模块以提取多粒度级别的说话人表征。为提升语音解纠缠与重构性能,我们引入基于循环的训练策略来递归模拟零样本推理过程,并通过内容、风格和说话人三个维度的循环一致性约束驱动该过程。实验表明,MTCR-VC在保持良好语音自然度的同时,在说话人音色建模方面优于现有零样本语音转换方法。