In automatic speech recognition, any factor that alters the acoustic properties of speech can pose a challenge to the system's performance. This paper presents a novel approach for automatic whispered speech recognition in the Irish dialect using the self-supervised WavLM model. Conventional automatic speech recognition systems often fail to accurately recognise whispered speech due to its distinct acoustic properties and the scarcity of relevant training data. To address this challenge, we utilized a pre-trained WavLM model, fine-tuned with a combination of whispered and normal speech data from the wTIMIT and CHAINS datasets, which include the English language in Singaporean and Irish dialects, respectively. Our baseline evaluation with the OpenAI Whisper model highlighted its limitations, achieving a Word Error Rate (WER) of 18.8% and a Character Error Rate (CER) of 4.24% on whispered speech. In contrast, the proposed WavLM-based system significantly improved performance, achieving a WER of 9.22% and a CER of 2.59%. These results demonstrate the efficacy of our approach in recognising whispered speech and underscore the importance of tailored acoustic modeling for robust automatic speech recognition systems. This study provides valuable insights into developing effective automatic speech recognition solutions for challenging speech affected by whisper and dialect. The source codes for this paper are freely available.
翻译:在自动语音识别中,任何改变语音声学特性的因素都可能对系统性能构成挑战。本文提出了一种利用自监督WavLM模型进行爱尔兰方言自动耳语语音识别的新方法。传统自动语音识别系统常因耳语独特的声学特性及相关训练数据稀缺而无法准确识别耳语。为应对这一挑战,我们采用预训练的WavLM模型,结合wTIMIT和CHAINS数据集中包含新加坡英语和爱尔兰方言的耳语与正常语音数据进行微调。使用OpenAI Whisper模型的基线评估凸显了其局限性,在耳语识别中仅获得18.8%的词错误率(WER)和4.24%的字错误率(CER)。相比之下,所提出的基于WavLM的系统显著提升了性能,实现了9.22%的WER和2.59%的CER。这些结果证明了该方法在耳语识别中的有效性,并强调了定制化声学建模对构建鲁棒自动语音识别系统的重要性。本研究为开发针对受耳语和方言影响的挑战性语音的有效自动语音识别解决方案提供了宝贵见解。本文源代码已公开。