Neural codec language models achieve impressive zero-shot Text-to-Speech (TTS) by fully imitating the acoustic characteristics of a short speech prompt, including timbre, prosody, and paralinguistic information. However, such holistic imitation limits their ability to isolate and control individual attributes. In this paper, we present a unified codec language model SpeechEdit that extends zero-shot TTS with a selective control mechanism. By default, SpeechEdit reproduces the complete acoustic profile inferred from the speech prompt, but it selectively overrides only the attributes specified by explicit control instructions. To enable controllable modeling, SpeechEdit is trained on our newly constructed LibriEdit dataset, which provides delta (difference-aware) training pairs derived from LibriHeavy. Experimental results show that our approach maintains naturalness and robustness while offering flexible and localized control over desired attributes. Audio samples are available at https://speech-editing.github.io/speech-editing/.
翻译:神经编解码器语言模型通过完全模仿短语音提示的声学特征(包括音色、韵律和副语言信息),实现了令人印象深刻的零样本文本到语音合成。然而,这种整体性模仿限制了其分离和控制个体属性的能力。本文提出了一种统一的编解码器语言模型SpeechEdit,它通过选择性控制机制扩展了零样本TTS。默认情况下,SpeechEdit会复现从语音提示推断出的完整声学特征,但仅选择性覆盖由显式控制指令指定的属性。为了实现可控建模,SpeechEdit在我们新构建的LibriEdit数据集上进行训练,该数据集提供了源自LibriHeavy的差值感知训练对。实验结果表明,我们的方法在保持自然度和鲁棒性的同时,能够对目标属性提供灵活且局部化的控制。音频样本可在 https://speech-editing.github.io/speech-editing/ 获取。