Automatic speech editing aims to modify spoken content based on textual instructions, yet traditional cascade systems suffer from complex preprocessing pipelines and a reliance on explicit external temporal alignment. Addressing these limitations, we propose CosyEdit, an end-to-end speech editing model adapted from CosyVoice through task-specific fine-tuning and an optimized inference procedure, which internalizes speech-text alignment while ensuring high consistency between the speech before and after editing. By fine-tuning on only 250 hours of supervised data from our curated GigaEdit dataset, our 400M-parameter model achieves reliable speech editing performance. Experiments on the RealEdit benchmark indicate that CosyEdit not only outperforms several billion-parameter language model baselines but also matches the performance of state-of-the-art cascade approaches. These results demonstrate that, with task-specific fine-tuning and inference optimization, robust and efficient speech editing capabilities can be unlocked from a zero-shot TTS model, yielding a novel and cost-effective end-to-end solution for high-quality speech editing.
翻译:自动语音编辑旨在根据文本指令修改语音内容,然而传统的级联系统面临复杂的预处理流程以及对显式外部时序对齐的依赖。为应对这些限制,我们提出了CosyEdit——一种通过任务特定微调与优化推理流程从CosyVoice适配而来的端到端语音编辑模型,该模型内化了语音-文本对齐机制,同时确保编辑前后语音的高度一致性。仅使用我们构建的GigaEdit数据集中250小时的监督数据进行微调,我们的4亿参数模型即可实现可靠的语音编辑性能。在RealEdit基准测试上的实验表明,CosyEdit不仅超越了数个十亿参数语言模型基线,更可与最先进的级联方法性能相媲美。这些结果证明,通过任务特定微调与推理优化,能够从零样本TTS模型中解锁稳健高效的语音编辑能力,从而为高质量语音编辑提供了一种新颖且经济高效的端到端解决方案。