Modern voice cloning (VC) can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practical deployments, modern audio generation models inevitably encounter noisy reference audios, imperfect text prompts, and diverse downstream processing, which can significantly hurt robustness. Despite rapid progress in VC driven by autoregressive codec-token language models and diffusion-based models, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive benchmark that evaluates Robustness in VC across the full generation pipeline, including input variation, generation challenges, output post-processing, and adversarial perturbations, covering 10 robustness tasks, 225 speakers, 14,370 utterances, and 11 representative modern VC models. Our evaluation uncovers substantial robustness gaps in VC: performance can deteriorate sharply under common input shifts and post-processing; long-context and cross-lingual scenarios further expose stability limitations; and both passive noise and proactive perturbation influence generation robustness. Collectively, these findings provide a unified picture of how current VC models fail in practice and introduce a standardized, open-source testbed to support the development of more robust and deployable VC models. We open-source our project at https://github.com/Nanboy-Ronan/RVCBench.
翻译:现代语音克隆技术仅需数秒参考音频即可合成与目标说话人高度相似的语音,从而支持个性化语音接口与配音等应用。在实际部署中,现代音频生成模型不可避免地会遭遇含噪参考音频、不完善的文本提示以及多样化的下游处理,这些因素可能严重损害其鲁棒性。尽管基于自回归编解码器标记语言模型和扩散模型的语音克隆技术发展迅速,其在现实部署场景变化下的鲁棒性仍未得到充分探索。本文提出RVCBench——一个涵盖完整生成流程的综合性语音克隆鲁棒性基准测试框架,包括输入变异、生成挑战、输出后处理及对抗扰动四个维度,覆盖10项鲁棒性任务、225位说话人、14,370条话语及11个代表性现代语音克隆模型。我们的评估揭示了语音克隆领域显著的鲁棒性缺陷:在常见输入变化与后处理下性能可能急剧下降;长上下文与跨语言场景进一步暴露出稳定性局限;被动噪声与主动扰动均会影响生成鲁棒性。这些发现共同构建了当前语音克隆模型在实际应用中失效机制的统一图景,并提供了一个标准化、开源测试平台以支持开发更具鲁棒性和可部署性的语音克隆模型。本项目已在 https://github.com/Nanboy-Ronan/RVCBench 开源。