Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet

This paper presents an end-to-end deep learning model for Automatic Speech Recognition (ASR) that transcribes Nepali speech to text. The model was trained and tested on the OpenSLR (audio, text) dataset. The majority of the audio dataset have silent gaps at both ends which are clipped during dataset preprocessing for a more uniform mapping of audio frames and their corresponding texts. Mel Frequency Cepstral Coefficients (MFCCs) are used as audio features to feed into the model. The model having Bidirectional LSTM paired with ResNet and one-dimensional CNN produces the best results for this dataset out of all the models (neural networks with variations of LSTM, GRU, CNN, and ResNet) that have been trained so far. This novel model uses Connectionist Temporal Classification (CTC) function for loss calculation during training and CTC beam search decoding for predicting characters as the most likely sequence of Nepali text. On the test dataset, the character error rate (CER) of 17.06 percent has been achieved. The source code is available at: https://github.com/manishdhakal/ASR-Nepali-using-CNN-BiLSTM-ResNet.

翻译：本文提出了一种端到端的深度学习模型，用于实现尼泊尔语语音到文本的自动语音识别。该模型在OpenSLR音频-文本数据集上进行了训练与测试。数据集中大部分音频的首尾存在静音段，在数据预处理阶段已将其截除，以实现音频帧与其对应文本之间更一致的映射。梅尔频率倒谱系数被用作输入模型的音频特征。在迄今已训练的所有模型（包含LSTM、GRU、CNN及ResNet变体的神经网络）中，由双向LSTM结合ResNet与一维CNN构成的模型在该数据集上取得了最佳效果。这一新颖模型采用连接时序分类损失函数进行训练时的损失计算，并利用CTC束搜索解码器将预测字符输出为最可能的尼泊尔文本序列。在测试数据集上，该模型实现了17.06%的字错误率。源代码已公开于：https://github.com/manishdhakal/ASR-Nepali-using-CNN-BiLSTM-ResNet。

相关内容

长短期记忆网络

关注 120

长短期记忆网络(LSTM)是一种用于深度学习领域的人工回归神经网络(RNN)结构。与标准的前馈神经网络不同，LSTM具有反馈连接。它不仅可以处理单个数据点(如图像)，还可以处理整个数据序列(如语音或视频)。例如，LSTM适用于未分段、连接的手写识别、语音识别、网络流量或IDSs(入侵检测系统)中的异常检测等任务。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日