This paper presents an end-to-end deep learning model for Automatic Speech Recognition (ASR) that transcribes Nepali speech to text. The model was trained and tested on the OpenSLR (audio, text) dataset. The majority of the audio dataset have silent gaps at both ends which are clipped during dataset preprocessing for a more uniform mapping of audio frames and their corresponding texts. Mel Frequency Cepstral Coefficients (MFCCs) are used as audio features to feed into the model. The model having Bidirectional LSTM paired with ResNet and one-dimensional CNN produces the best results for this dataset out of all the models (neural networks with variations of LSTM, GRU, CNN, and ResNet) that have been trained so far. This novel model uses Connectionist Temporal Classification (CTC) function for loss calculation during training and CTC beam search decoding for predicting characters as the most likely sequence of Nepali text. On the test dataset, the character error rate (CER) of 17.06 percent has been achieved. The source code is available at: https://github.com/manishdhakal/ASR-Nepali-using-CNN-BiLSTM-ResNet.
翻译:本文提出了一种端到端的深度学习模型,用于实现尼泊尔语语音到文本的自动语音识别。该模型在OpenSLR音频-文本数据集上进行了训练与测试。数据集中大部分音频的首尾存在静音段,在数据预处理阶段已将其截除,以实现音频帧与其对应文本之间更一致的映射。梅尔频率倒谱系数被用作输入模型的音频特征。在迄今已训练的所有模型(包含LSTM、GRU、CNN及ResNet变体的神经网络)中,由双向LSTM结合ResNet与一维CNN构成的模型在该数据集上取得了最佳效果。这一新颖模型采用连接时序分类损失函数进行训练时的损失计算,并利用CTC束搜索解码器将预测字符输出为最可能的尼泊尔文本序列。在测试数据集上,该模型实现了17.06%的字错误率。源代码已公开于:https://github.com/manishdhakal/ASR-Nepali-using-CNN-BiLSTM-ResNet。