One of the major challenges for developing automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios. We then exploited the developed corpus to design a conformer-based ASR system. We benchmarked the trained ASR with publicly available datasets and compared it with other available models. To investigate the efficacy, we designed and developed a human-annotated domain-agnostic test set composed of news, telephony, and conversational data among others. Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets. The experimental resources will be publicly available.(https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR)
翻译:低资源语言自动语音识别(ASR)开发的主要挑战之一,在于缺乏带有领域特定变体的标注数据。本研究提出一种伪标签方法,用于构建大规模领域无关的ASR数据集。通过该方法,我们构建了一个涵盖多样化主题、说话风格、方言、噪声环境及对话场景的20000小时以上标注孟加拉语语音数据集。进而利用该语料库设计了基于Conformer的ASR系统,并在公开数据集上对所训练的ASR系统进行基准测试,与其他现有模型进行了对比。为验证有效性,我们设计并开发了一个包含新闻、电话通信及对话等数据的人类标注领域无关测试集。结果表明,基于伪标签数据训练的模型在定制测试集以及公开可用的孟加拉语数据集上均具有有效性。实验资源将公开提供(https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR)。