We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intelligibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.
翻译:本文提出MaskVCT,一种通过多重无分类器引导实现多因素可控的零样本语音转换模型。现有语音转换模型通常依赖固定条件机制,而MaskVCT在单一模型中整合了多样化条件。为增强鲁棒性与可控性,本模型可采用连续或量化语音特征以提升可懂度与说话人相似性,并可通过选择是否使用基频轮廓来控制韵律特征。这些设计使用户能在零样本语音转换场景中灵活平衡说话人身份、语言内容与韵律要素。大量实验表明,MaskVCT在目标说话人与口音相似度上达到最优性能,同时在词错误率与字错误率指标上与现有基线模型保持竞争力。音频样本详见 https://maskvct.github.io/。