Neural text-to-speech (TTS) systems systematically mispronounce low-resource proper nouns, particularly non-English names, brands, and geographic locations, due to their underrepresentation in predominantly English training corpora. Existing solutions typically rely on expensive multilingual data collection, supervised finetuning, or manual phonetic annotation, which limits the deployment of TTS systems in linguistically diverse settings. We introduce SonoEdit, a model editing technique that surgically corrects pronunciation errors in pre-trained TTS models without retraining. Instead of costly finetuning or explicit phoneme injection, we propose a parsimonious alternative based on Null-Space Pronunciation Editing, which performs a single-shot parameter update to modify the pronunciation of specific words while provably preserving all other model behavior. We first adapt Acoustic Causal Tracing to identify the Transformer layers responsible for text-to-pronunciation mapping. We then apply Null-Space Constrained Editing to compute a closed-form weight update that corrects the target pronunciation while remaining mathematically orthogonal to the subspace governing general speech generation. This constrained update steers the model's acoustic output toward a desired pronunciation exemplar while guaranteeing zero first-order change on a preserved speech corpus.
翻译:神经文本转语音(TTS)系统会系统性地误读低资源专有名词,尤其是非英语人名、品牌名和地理名称,这是由于它们在以英语为主的训练语料库中代表性不足所致。现有解决方案通常依赖于昂贵的多语言数据收集、有监督微调或手动音素标注,这限制了TTS系统在语言多样化环境中的部署。我们提出了SonoEdit,一种模型编辑技术,能够在不重新训练的情况下,精准地修正预训练TTS模型中的发音错误。我们提出了一种基于空域发音编辑的简约替代方案,以取代昂贵的微调或显式音素注入,该方案通过单次参数更新来修改特定单词的发音,同时可证明地保留所有其他模型行为。我们首先采用声学因果追踪来识别负责文本到发音映射的Transformer层。随后,应用空域约束编辑来计算一个闭式权重更新,该更新在修正目标发音的同时,在数学上正交于控制通用语音生成的子空间。这种约束更新将模型的声学输出导向期望的发音范例,同时保证在保留的语音语料上产生零一阶变化。