Automatic speech recognition systems have achieved remarkable performance on fluent speech but continue to degrade significantly when processing stuttered speech, a limitation that is particularly acute for low-resource languages like Indonesian where specialized datasets are virtually non-existent. To overcome this scarcity, we propose a data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text through a combination of rule-based transformations and large language models followed by text-to-speech synthesis. We apply this synthetic data to fine-tune a pre-trained Indonesian Whisper model using transfer learning, enabling the architecture to adapt to dysfluent acoustic patterns without requiring large-scale real-world recordings. Our experiments demonstrate that this targeted synthetic exposure consistently reduces recognition errors on stuttered speech while maintaining performance on fluent segments, validating the utility of synthetic data pipelines for developing more inclusive speech technologies in under-represented languages.
翻译:自动语音识别系统在流畅语音上已取得显著性能,但在处理口吃语音时性能仍会大幅下降,这一局限对于印尼语等低资源语言尤为突出,因为此类语言几乎不存在专门数据集。为克服数据稀缺问题,我们提出一种数据增强框架,通过将基于规则的转换与大型语言模型相结合,向流畅文本中注入重复和延长型不流畅模式,再经文本转语音合成生成合成口吃音频。我们利用该合成数据,通过迁移学习对预训练的印尼语Whisper模型进行微调,使架构能够适应非流畅声学模式,而无需大规模真实世界录音。实验表明,这种定向合成数据暴露能持续降低口吃语音的识别错误率,同时保持对流畅片段的识别性能,从而验证了合成数据流水线在为代表性不足语言开发更具包容性语音技术方面的实用性。