Synthetic accented speech is a promising way to improve automatic speech recognition (ASR) when real accented recordings are scarce. We ask what makes such data useful for ASR fine-tuning: target-accent phoneme edits that expose the recognizer to accent-specific pronunciations, or random phoneme perturbations that act as augmentation in phoneme space. In a few-shot TTS pipeline, we compare LLM-generated accent edits with matched-rate random substitutions and oracle controls using ground-truth accented phonemes and prosody. Random substitutions recover much of the ASR gain: LLM target-accent edits improve over random by only a small margin, ground-truth phonemes stay close to the random baseline and nearly converge with it as the synthetic ASR fine-tuning set grows larger, and adding ground-truth prosody yields only a modest further gain. Mixing synthetic with real accented speech also stabilizes low-resource fine-tuning, but a fixed synthetic budget can later dilute the information in real data, showing that the real--synthetic ratio matters.
翻译:合成口音语音是一种有前景的方法,可在真实口音录音稀缺时提升自动语音识别(ASR)的性能。我们探究此类数据对ASR微调有效的成因:是目标口音音素编辑使识别器接触到口音特定的发音,还是作为音素空间数据增强的随机音素扰动。在一个少样本TTS流水线中,我们对比了LLM生成的口音编辑、匹配频率的随机替换,以及使用真实口音音素和韵律的oracle控制。随机替换恢复了大部分ASR增益:LLM目标口音编辑相比随机仅带来微小提升;真实口音音素接近随机基线,且随合成ASR微调集增大而几乎与之收敛;添加真实口音韵律仅带来适度增益。将合成语音与真实口音语音混合还可稳定低资源微调,但固定的合成预算可能在后期稀释真实数据的信息,表明真实-合成比例至关重要。