Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglement, leading to generation instability and boundary artifacts. In this paper, we propose a novel framework grounded in the principle of "Edit Content, Preserve Acoustics". Our approach relies on two core components: (1) Structural Foundations, which decouples editing into a stable semantic space while delegating acoustic reconstruction to a Flow Matching decoder; and (2) Perceptual Alignment, which employs a novel Self-Consistency Rewards Group Relative Policy Optimization. By leveraging a pre-trained Text-to-Speech model as an implicit critic -- complemented by strict intelligibility and duration constraints -- we effectively align the edited semantic token sequence with the original context. Empirical evaluations demonstrate that our method significantly outperforms state-of-the-art autoregressive and non-autoregressive baselines, achieving superior intelligibility, robustness, and perceptual quality.
翻译:基于文本的不可感知语音编辑技术允许用户通过修改转录文本来改变语音内容。该技术要求修改后的语音片段能够与上下文环境无缝融合。当前主流方法在声学空间中进行操作,受限于固有的内容-风格纠缠问题,导致生成不稳定和边界伪影。本文提出一种基于"编辑内容,保持声学特征"原则的新型框架。我们的方法依赖于两个核心组件:(1) 结构基础:将编辑过程解耦至稳定的语义空间,同时将声学重建任务委托给Flow Matching解码器;(2) 感知对齐:采用新型的自洽奖励组相对策略优化方法。通过将预训练的文本到语音模型作为隐式评判器——辅以严格的清晰度和时长约束——我们有效实现了编辑后的语义标记序列与原始上下文的对齐。实证评估表明,本方法在清晰度、鲁棒性和感知质量方面显著优于当前最先进的自回归与非自回归基线模型。