Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.
翻译:语音编辑与零样本文本转语音(TTS)共享基于语音提示的条件生成基础,但语音编辑对未编辑内容周围的局部声学一致性要求更为严格。尽管已有研究表明监督微调(SFT)能使TTS模型获得功能性编辑能力,但该方法从根本上受限于不完美的配对编辑数据和粗粒度的优化信号。为解决这些局限,我们提出CosyEdit2——一种基于两阶段后训练框架构建的语音编辑模型,该框架从监督编辑初始化逐步过渡到基于无目标语音数据的编辑导向型群体相对策略优化(GRPO)。大量实验表明,CosyEdit2不仅显著提升了语音编辑性能,还解锁了更优的零样本TTS能力,揭示了这两项任务之间更深层次的相互关联。音频样本参见 https://cjy1018.github.io/CosyEdit2。