Whisper Finetuning on Nepali Language

Despite the growing advancements in Automatic Speech Recognition (ASR) models, the development of robust models for underrepresented languages, such as Nepali, remains a challenge. This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI's Whisper models of different sizes to improve transcription (speech-to-text) accuracy for the Nepali language. We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation. Our experimental results demonstrate that fine-tuning Whisper models on our curated custom dataset substantially reduces the Word Error Rate (WER) across all model sizes attributed to larger data variations in terms of speaker's age, gender, and sentiment, acoustic environment, dialect, denser audio segments (15-30 seconds) that are more compatible with Whisper's input, and manual curation of audios and transcriptions. Notably, our approach outperforms Whisper's baseline models trained on Fleur's dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models. Furthermore, we show that data augmentation plays a significant role in enhancing model robustness. Our approach underlines the importance of dataset quality, variation, and augmentation in the adaptation of state-of-the-art models to underrepresented languages for developing accurate ASR systems.

翻译：尽管自动语音识别（ASR）模型不断取得进展，但为尼泊尔语等代表性不足的语言开发鲁棒模型仍面临挑战。本研究通过构建一个全面且通用的数据集，并在此基础上对不同规模的OpenAI Whisper模型进行微调，以提升尼泊尔语的语音转写准确率。我们整合了公开可用的ASR数据集及自行录制的定制数据集，其中包含多样化的口音、方言和说话风格，并通过数据增强技术进一步丰富数据。实验结果表明，基于我们构建的定制数据集对Whisper模型进行微调，能显著降低所有模型规模的词错误率。这归因于数据集在以下方面的广泛多样性：说话者的年龄、性别与情感状态、声学环境、方言特征、与Whisper输入更匹配的密集音频片段（15-30秒），以及对音频和转录文本的人工校订。值得注意的是，我们的方法在Fleur数据集训练的Whisper基线模型基础上实现了显著提升，其中小型模型词错误率降低达36.2%，中型模型降低23.8%。此外，我们证明数据增强对提升模型鲁棒性具有重要作用。本研究强调了数据集质量、多样性和增强技术在将前沿模型适配于代表性不足语言、开发高精度ASR系统中的关键意义。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日