Investigating the effect of domain selection on automatic speech recognition performance: a case study on Bangladeshi Bangla

The performance of data-driven natural language processing systems is contingent upon the quality of corpora. However, principal corpus design criteria are often not identified and examined adequately, particularly in the speech processing discipline. Speech corpora development requires additional attention with regard to clean/noisy, read/spontaneous, multi-talker speech, accents/dialects, etc. Domain selection is also a crucial decision point in speech corpus development. In this study, we demonstrate the significance of domain selection by assessing a state-of-the-art Bangla automatic speech recognition (ASR) model on a novel multi-domain Bangladeshi Bangla ASR evaluation benchmark - BanSpeech, which contains 7.2 hours of speech and 9802 utterances from 19 distinct domains. The ASR model has been trained with deep convolutional neural network (CNN), layer normalization technique, and Connectionist Temporal Classification (CTC) loss criterion on SUBAK.KO, a mostly read speech corpus for the low-resource and morphologically rich language Bangla. Experimental evaluation reveals the ASR model on SUBAK.KO faces difficulty recognizing speech from domains with mostly spontaneous speech and has a high number of out-of-vocabulary (OOV) words. The same ASR model, on the other hand, performs better in read speech domains and contains fewer OOV words. In addition, we report the outcomes of our experiments with layer normalization, input feature extraction, number of convolutional layers, etc., and set a baseline on SUBAK.KO. The BanSpeech will be publicly available to meet the need for a challenging evaluation benchmark for Bangla ASR.

翻译：数据驱动的自然语言处理系统的性能取决于语料库的质量。然而，主要的语料库设计标准往往未被充分识别和检验，尤其是在语音处理领域。语音语料库的开发需要额外关注干净/嘈杂、朗读/自发性、多说话者语音、口音/方言等方面。领域选择也是语音语料库开发中的一个关键决策点。在本研究中，我们通过评估一种最先进的孟加拉语自动语音识别（ASR）模型，在新型多领域孟加拉国孟加拉语ASR评估基准——BanSpeech上，展示了领域选择的重要性。BanSpeech包含来自19个不同领域的7.2小时语音和9802个话语。该ASR模型基于深度卷积神经网络（CNN）、层归一化技术和连接主义时序分类（CTC）损失准则，在SUBAK.KO（一种主要为朗读语音的语料库，针对资源匮乏且形态丰富的孟加拉语）上进行了训练。实验评估表明，基于SUBAK.KO的ASR模型在识别以自发性语音为主的领域时存在困难，且包含大量词典外（OOV）词汇。另一方面，该ASR模型在朗读语音领域表现更好，OOV词汇较少。此外，我们报告了关于层归一化、输入特征提取、卷积层数量等的实验结果，并在SUBAK.KO上设定了基线。BanSpeech将公开提供，以满足孟加拉语ASR对挑战性评估基准的需求。