SonoEdit: Null-Space Constrained Knowledge Editing for Pronunciation Correction in LLM-Based TTS

Neural text-to-speech (TTS) systems systematically mispronounce low-resource proper nouns, particularly non-English names, brands, and geographic locations, due to their underrepresentation in predominantly English training corpora. Existing solutions typically rely on expensive multilingual data collection, supervised finetuning, or manual phonetic annotation, which limits the deployment of TTS systems in linguistically diverse settings. We introduce SonoEdit, a model editing technique that surgically corrects pronunciation errors in pre-trained TTS models without retraining. Instead of costly finetuning or explicit phoneme injection, we propose a parsimonious alternative based on Null-Space Pronunciation Editing, which performs a single-shot parameter update to modify the pronunciation of specific words while provably preserving all other model behavior. We first adapt Acoustic Causal Tracing to identify the Transformer layers responsible for text-to-pronunciation mapping. We then apply Null-Space Constrained Editing to compute a closed-form weight update that corrects the target pronunciation while remaining mathematically orthogonal to the subspace governing general speech generation. This constrained update steers the model's acoustic output toward a desired pronunciation exemplar while guaranteeing zero first-order change on a preserved speech corpus.

翻译：神经文本转语音（TTS）系统会系统性地误读低资源专有名词，尤其是非英语人名、品牌名和地理名称，这是由于它们在以英语为主的训练语料库中代表性不足所致。现有解决方案通常依赖于昂贵的多语言数据收集、有监督微调或手动音素标注，这限制了TTS系统在语言多样化环境中的部署。我们提出了SonoEdit，一种模型编辑技术，能够在不重新训练的情况下，精准地修正预训练TTS模型中的发音错误。我们提出了一种基于空域发音编辑的简约替代方案，以取代昂贵的微调或显式音素注入，该方案通过单次参数更新来修改特定单词的发音，同时可证明地保留所有其他模型行为。我们首先采用声学因果追踪来识别负责文本到发音映射的Transformer层。随后，应用空域约束编辑来计算一个闭式权重更新，该更新在修正目标发音的同时，在数学上正交于控制通用语音生成的子空间。这种约束更新将模型的声学输出导向期望的发音范例，同时保证在保留的语音语料上产生零一阶变化。

相关内容

语音合成

关注 0

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

如何将领域知识注入大模型？最新《将领域特定知识注入大语言模型》综述

专知会员服务

79+阅读 · 2025年2月24日

迈向可控语音合成：大语言模型时代的综述

专知会员服务

24+阅读 · 2024年12月13日

【2023新书】神经文本到语音合成，214页pdf

专知会员服务

39+阅读 · 2023年6月9日

自动语音识别:简介、当前趋势和有待解决的问题，97页slides

专知会员服务

24+阅读 · 2022年12月20日