Despite the growing advancements in Automatic Speech Recognition (ASR) models, the development of robust models for underrepresented languages, such as Nepali, remains a challenge. This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI's Whisper models of different sizes to improve transcription (speech-to-text) accuracy for the Nepali language. We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation. Our experimental results demonstrate that fine-tuning Whisper models on our curated custom dataset substantially reduces the Word Error Rate (WER) across all model sizes attributed to larger data variations in terms of speaker's age, gender, and sentiment, acoustic environment, dialect, denser audio segments (15-30 seconds) that are more compatible with Whisper's input, and manual curation of audios and transcriptions. Notably, our approach outperforms Whisper's baseline models trained on Fleur's dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models. Furthermore, we show that data augmentation plays a significant role in enhancing model robustness. Our approach underlines the importance of dataset quality, variation, and augmentation in the adaptation of state-of-the-art models to underrepresented languages for developing accurate ASR systems.
翻译:尽管自动语音识别(ASR)模型不断取得进展,但为尼泊尔语等代表性不足的语言开发鲁棒模型仍面临挑战。本研究通过构建一个全面且通用的数据集,并在此基础上对不同规模的OpenAI Whisper模型进行微调,以提升尼泊尔语的语音转写准确率。我们整合了公开可用的ASR数据集及自行录制的定制数据集,其中包含多样化的口音、方言和说话风格,并通过数据增强技术进一步丰富数据。实验结果表明,基于我们构建的定制数据集对Whisper模型进行微调,能显著降低所有模型规模的词错误率。这归因于数据集在以下方面的广泛多样性:说话者的年龄、性别与情感状态、声学环境、方言特征、与Whisper输入更匹配的密集音频片段(15-30秒),以及对音频和转录文本的人工校订。值得注意的是,我们的方法在Fleur数据集训练的Whisper基线模型基础上实现了显著提升,其中小型模型词错误率降低达36.2%,中型模型降低23.8%。此外,我们证明数据增强对提升模型鲁棒性具有重要作用。本研究强调了数据集质量、多样性和增强技术在将前沿模型适配于代表性不足语言、开发高精度ASR系统中的关键意义。