RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models

Modern voice cloning, also known as zero-shot text-to-speech (TTS), can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practice, these systems often face noisy reference audio, imperfect text prompts, multilingual and long-form generation, post-processing, and adversarial perturbations, all of which can weaken robustness. Despite rapid progress in codec-token language models and diffusion-based TTS, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive dataset and benchmark for evaluating robustness in voice cloning. RVCBench provides task-aligned tests covering controlled text-audio pairing, multilingual and long-form scenarios, expressive prompts, post-processing conditions, and passive or proactive audio perturbations. Across 18 robustness evaluations, 225 speakers, and 14,370 utterances, RVCBench supports unified evaluation of input sensitivity, generation stability, output resilience, perturbation robustness, speaker similarity, and deepfake detectability. We evaluate 18 representative open-source voice cloning models and reveal systematic vulnerabilities in content consistency, speaker similarity, long-form stability, post-processing resilience, adversarial robustness, and detector-facing separability. We release the code and dataset to support reproducible evaluation and future research on robust voice cloning, speech synthesis, and audio generation. Code: https://github.com/Nanboy-Ronan/RVCBench. Dataset: https://huggingface.co/datasets/Nanboy/RVCBench.

翻译：现代语音克隆技术（也称为零样本文本转语音）能够仅凭数秒参考音频合成与目标说话人高度匹配的语音，从而支持个性化语音界面与配音等应用。实际应用中，这些系统常面临含噪参考音频、不完美文本提示、多语言与长文本生成、后处理以及对抗性扰动等问题，这些因素均可能削弱系统鲁棒性。尽管编解码器令牌语言模型与扩散式TTS技术发展迅速，但真实部署场景下的鲁棒性研究仍显不足。本文提出RVCBench——一个用于评估语音克隆鲁棒性的综合数据集与基准测试框架。RVCBench提供任务对齐的测试项目，涵盖受控文本-音频配对、多语言与长文本场景、情感化提示、后处理条件以及被动/主动音频扰动。通过18项鲁棒性评估、225位说话人及14,370条语音数据，RVCBench支持对输入敏感性、生成稳定性、输出韧性、扰动鲁棒性、说话人相似度及深度伪造检测能力进行统一评估。我们对18个代表性开源语音克隆模型进行评测，揭示了其在内容一致性、说话人相似度、长文本稳定性、后处理韧性、对抗鲁棒性及检测器可分离性方面的系统性缺陷。我们已公开代码与数据集，以支持可复现的评估及未来在鲁棒语音克隆、语音合成与音频生成领域的研究。代码：https://github.com/Nanboy-Ronan/RVCBench。数据集：https://huggingface.co/datasets/Nanboy/RVCBench。