E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis

Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard's effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.

翻译：语音合成技术的最新进展丰富了我们的日常生活，高质量且拟人化的音频已在现实应用中广泛采用。然而，恶意利用如语音克隆欺诈带来了严重的安全风险。现有防御技术难以应对基于生产级大语言模型（LLM）的语音合成。尽管先前研究考虑了针对微调合成器的保护，但它们假设存在人工标注的转录文本。鉴于人工标注的劳动密集性，利用自动语音识别（ASR）生成转录文本的端到端（E2E）系统正日益普及，例如通过商业API进行语音克隆。因此，这种E2E语音合成也需要新的安全机制。为应对这些挑战，我们提出了E2E-VGuard，一个针对两种新兴威胁的主动防御框架：（1）生产级基于LLM的语音合成，以及（2）由ASR驱动的E2E场景中产生的新型攻击。具体而言，我们采用编码器集成与特征提取器来保护音色，同时利用针对ASR的对抗样本来干扰发音。此外，我们整合了心理声学模型以确保扰动的不可感知性。为进行全面评估，我们在中英文数据集上测试了16个开源合成器和3个商业API，证实了E2E-VGuard在音色和发音保护方面的有效性。我们还进行了实际部署验证。我们的代码和演示页面可在https://wxzyd123.github.io/e2e-vguard/获取。