Despite major advancements in Automatic Speech Recognition (ASR), the state-of-the-art ASR systems struggle to deal with impaired speech even with high-resource languages. In Arabic, this challenge gets amplified, with added complexities in collecting data from dysarthric speakers. In this paper, we aim to improve the performance of Arabic dysarthric automatic speech recognition through a multi-stage augmentation approach. To this effect, we first propose a signal-based approach to generate dysarthric Arabic speech from healthy Arabic speech by modifying its speed and tempo. We also propose a second stage Parallel Wave Generative (PWG) adversarial model that is trained on an English dysarthric dataset to capture language-independant dysarthric speech patterns and further augment the signal-adjusted speech samples. Furthermore, we propose a fine-tuning and text-correction strategies for Arabic Conformer at different dysarthric speech severity levels. Our fine-tuned Conformer achieved 18% Word Error Rate (WER) and 17.2% Character Error Rate (CER) on synthetically generated dysarthric speech from the Arabic commonvoice speech dataset. This shows significant WER improvement of 81.8% compared to the baseline model trained solely on healthy data. We perform further validation on real English dysarthric speech showing a WER improvement of 124% compared to the baseline trained only on healthy English LJSpeech dataset.
翻译:尽管自动语音识别(ASR)取得了重大进展,但最先进的ASR系统在处理受损语音方面仍面临挑战,即使对于高资源语言也是如此。在阿拉伯语中,这一挑战更加严峻,因为收集构音障碍说话者的数据存在额外的复杂性。本文旨在通过多阶段增强方法提升阿拉伯语构音障碍自动语音识别的性能。为此,我们首先提出一种基于信号的方法,通过修改健康阿拉伯语语速和节奏生成构音障碍阿拉伯语语音。我们还提出了一种第二阶段并行波生成(PWG)对抗模型,该模型使用英语构音障碍数据集进行训练,以捕获与语言无关的构音障碍语音模式,并进一步增强信号调整后的语音样本。此外,我们针对阿拉伯语Conformer模型在不同构音障碍严重程度下提出了微调和文本校正策略。在从阿拉伯语CommonVoice语音数据集合成生成的构音障碍语音上,我们的微调Conformer达到了18%的词错误率(WER)和17.2%的字符错误率(CER)。与仅在健康数据上训练的基线模型相比,WER显著提升了81.8%。我们在真实英语构音障碍语音上进行进一步验证,与仅在健康英语LJSpeech数据集上训练的基线相比,WER提升了124%。